计算机视觉与模式识别学术速递[12.24]

格林先生MrGreen arXiv每日学术速递 2022-04-26

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CV 方向，今日共计53篇

Transformer(4篇)

【1】 ELSA: Enhanced Local Self-Attention for Vision Transformer
标题：ELSA：增强视觉转换器的局部自我注意
链接：https://arxiv.org/abs/2112.12786

作者：Jingkai Zhou,Pichao Wang,Fan Wang,Qiong Liu,Hao Li,Rong Jin
机构：South China University of Technology, Alibaba Group
备注：Project at \url{this https URL}
摘要：自我关注在建模长期依赖方面很强大，但在局部精细特征学习方面却很弱。局域自聚焦（LSA）的性能与卷积相当，而不如动态滤波器，这就困扰了研究者是否使用LSA或它的对应物，哪一个更好，什么使得LSA平庸。为了澄清这些问题，我们从两个方面对LSA及其对应物进行了全面的研究：\emph{channel setting}和\emph{spatial processing}。我们发现，问题在于空间注意的产生和应用，其中相对位置嵌入和相邻过滤器的应用是关键因素。基于这些发现，我们提出了增强的局部自我注意（ELSA）与哈达玛注意和鬼头。Hadamard attention引入Hadamard乘积，在保持高阶映射的同时，在相邻情况下有效地生成注意。ghost head将注意力贴图与静态矩阵相结合，以增加通道容量。实验证明了ELSA的有效性。在不修改体系结构/超参数的情况下，用ELSA替换LSA可将Swin Transformer\cite{Swin}的精度提高1.4以上。从D1到D5，ELSA也一直使VOLO\cite{VOLO}受益，其中ELSA-VOLO-D5在ImageNet-1K上达到87.2，无需额外的训练图像。此外，我们还评估了下游任务中的ELSA。在COCO上，ELSA显著提高了基线值+1.9箱Ap/+1.3面罩Ap，而在ADE20K上，ELSA显著提高了基线值+1.9百万。代码位于\url{https://github.com/damo-cv/ELSA}.
摘要：Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: \emph{channel setting} and \emph{spatial processing}. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer \cite{swin} by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO \cite{volo} from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at \url{https://github.com/damo-cv/ELSA}.

【2】 SeMask: Semantically Masked Transformers for Semantic Segmentation
标题：SeMask：语义屏蔽的语义分词转换器
链接：https://arxiv.org/abs/2112.12782

作者：Jitesh Jain,Anukriti Singh,Nikita Orlov,Zilong Huang,Jiachen Li,Steven Walton,Humphrey Shi
机构：Picsart AI Research (PAIR), IIT Roorkee
备注：13 pages, 6 figures
摘要：在图像转换器网络的编码器部分微调预训练主干是语义分割任务的传统方法。然而，这种方法忽略了图像在编码阶段提供的语义上下文。本文认为，在精细调整的同时，将图像的语义信息合并到预训练的基于层次变换的主干中，可以显著提高性能。为了实现这一点，我们提出了SeMask，这是一个简单而有效的框架，通过语义注意操作将语义信息整合到编码器中。此外，我们在训练期间使用一个轻量级语义解码器，在每个阶段对中间语义先验图进行监督。我们的实验表明，加入语义先验提高了已建立的分层编码器的性能，而触发器的数量略有增加。我们通过将SeMask集成到Swin转换器的每个变体中，作为编码器与不同解码器的配对，提供了经验证明。我们的框架在ADE20K数据集上实现了58.22%mIoU的最新状态，在Cityscapes数据集上实现了超过3%的mIoU度量改进。代码和检查点可在https://github.com/Picsart-AI-Research/SeMask-Segmentation .
摘要：Finetuning a pretrained backbone in the encoder part of an image transformer network has been the traditional approach for the semantic segmentation task. However, such an approach leaves out the semantic context that an image provides during the encoding stage. This paper argues that incorporating semantic information of the image into pretrained hierarchical transformer-based backbones while finetuning improves the performance considerably. To achieve this, we propose SeMask, a simple and effective framework that incorporates semantic information into the encoder with the help of a semantic attention operation. In addition, we use a lightweight semantic decoder during training to provide supervision to the intermediate semantic prior maps at every stage. Our experiments demonstrate that incorporating semantic priors enhances the performance of the established hierarchical encoders with a slight increase in the number of FLOPs. We provide empirical proof by integrating SeMask into each variant of the Swin-Transformer as our encoder paired with different decoders. Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset. The code and checkpoints are publicly available at https://github.com/Picsart-AI-Research/SeMask-Segmentation .

【3】 LaTr: Layout-Aware Transformer for Scene-Text VQA
标题：LATR：面向场景-文本VQA的布局感知转换器
链接：https://arxiv.org/abs/2112.12494

作者：Ali Furkan Biten,Ron Litman,Yusheng Xie,Srikar Appalaraju,R. Manmatha
机构：Computer Vision Center, UAB, Spain, Amazon Web Services
摘要：我们提出了一种新的场景文本视觉问答（STVQA）多模式架构，称为布局感知转换器（LaTr）。STVQA的任务要求模型对不同的模式进行推理。因此，我们首先调查每种情态的影响，并揭示语言模块的重要性，特别是在布局信息丰富的情况下。考虑到这一点，我们提出了一个只需要文本和空间线索的单目标预训练方案。我们表明，在扫描文档上应用这种预训练方案比使用自然图像具有一定的优势，尽管存在域差距。扫描文档易于获取，文本密集，布局多样，通过将语言和布局信息结合在一起，帮助模型学习各种空间线索（例如左侧、下方等）。与现有的方法相比，我们的方法实现了无词汇解码，并且，如图所示，其泛化能力远远超过了训练词汇。我们进一步证明，LaTr提高了对OCR错误的鲁棒性，这是STVQA中失败案例的常见原因。此外，通过利用视觉转换器，我们消除了对外部对象检测器的需要。LaTr在多个数据集上优于最先进的STVQA方法。特别是，TextVQA为+7.6%，ST-VQA为+10.8%，OCR-VQA为+4.0%（所有绝对准确数字）。
摘要：We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues. We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images, despite the domain gap. Scanned documents are easy to procure, text-dense and have a variety of layouts, helping the model learn various spatial cues (e.g. left-of, below etc.) by tying together language and layout information. Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary. We further demonstrate that LaTr improves robustness towards OCR errors, a common reason for failure cases in STVQA. In addition, by leveraging a vision transformer, we eliminate the need for an external object detector. LaTr outperforms state-of-the-art STVQA methods on multiple datasets. In particular, +7.6% on TextVQA, +10.8% on ST-VQA and +4.0% on OCR-VQA (all absolute accuracy numbers).

【4】 Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding
标题：基于交叉注意变换和行为编码的多模态人格识别
链接：https://arxiv.org/abs/2112.12180

作者：Tanay Agrawal,Dhruv Agarwal,Michal Balazia,Neelabh Sinha,Francois Bremond
机构：INRIA Sophia Antipolis - M´editerran´ee, France, Universit´e Cˆote d’Azur, France, Indian Institute of Information Technology, Allahabad, India, Birla Institute of Technology and Science, Pilani, India
备注：Preprint. Final paper accepted at the 17th International Conference on Computer Vision Theory and Applications, VISAPP 2021, Virtual, February 6-8, 2022. 8 pages
摘要：人格计算和情感计算近年来在许多研究领域引起了人们的兴趣。该任务的数据集通常有多种模式，如视频、音频、语言和生物信号。在本文中，我们提出了一个灵活的任务模型，它利用了所有可用的数据。这项任务涉及复杂的关系，为了避免在视频处理中使用大型模型，我们建议使用行为编码，在对模型进行最小更改的情况下提高性能。使用Transformer的交叉关注在最近变得很流行，并被用于不同模式的融合。由于可能存在长期关系，因此不希望将输入分成块，因此建议的模型将整个输入一起处理。我们的实验表明了上述每一项贡献的重要性
摘要：Personality computing and affective computing have gained recent interest in many research areas. The datasets for the task generally have multiple modalities like video, audio, language and bio-signals. In this paper, we propose a flexible model for the task which exploits all available data. The task involves complex relations and to avoid using a large model for video processing specifically, we propose the use of behaviour encoding which boosts performance with minimal change to the model. Cross-attention using transformers has become popular in recent times and is utilised for fusion of different modalities. Since long term relations may exist, breaking the input into chunks is not desirable, thus the proposed model processes the entire input together. Our experiments show the importance of each of the above contributions

检测相关(5篇)

【1】 Towards Universal GAN Image Detection
标题：走向通用GaN图像检测
链接：https://arxiv.org/abs/2112.12606

作者：Davide Cozzolino,Diego Gragnaniello,Giovanni Poggi,Luisa Verdoliva
机构：University Federico II of Naples, Italy, Real, Fake
摘要：假图像的质量越来越高，传播范围越来越广，这促使人们寻求可靠的法医工具。近年来，人们提出了许多GAN图像探测器。然而，在现实场景中，大多数算法的鲁棒性和泛化能力有限。此外，它们通常依赖于测试时不可用的辅助信息，也就是说，它们不是通用的。我们研究了这些问题，并提出了一种基于有限次采样结构和合适的对比学习范式的新型GAN图像检测器。在具有挑战性的条件下进行的实验证明，该方法是通用GAN图像检测的第一步，确保了对常见图像损伤的良好鲁棒性，以及对不可见结构的良好泛化。
摘要：The ever higher quality and wide diffusion of fake images have spawn a quest for reliable forensic tools. Many GAN image detectors have been proposed, recently. In real world scenarios, however, most of them show limited robustness and generalization ability. Moreover, they often rely on side information not available at test time, that is, they are not universal. We investigate these problems and propose a new GAN image detector based on a limited sub-sampling architecture and a suitable contrastive learning paradigm. Experiments carried out in challenging conditions prove the proposed method to be a first step towards universal GAN image detection, ensuring also good robustness to common image impairments, and good generalization to unseen architectures.

【2】 Data-efficient learning for 3D mirror symmetry detection
标题：用于3D镜面对称性检测的数据高效学习
链接：https://arxiv.org/abs/2112.12579

作者：Yancong Lin,Silvia-Laura Pintea,Jan van Gemert
机构：Computer Vision Lab, Delft University of Technology, the Netherlands
备注：Technical report
摘要：我们介绍了一种几何启发的深度学习方法，用于从单视图图像中检测三维镜像平面。我们通过明确地将3D镜像几何体作为归纳先验知识加入到学习中，减少了对海量训练数据的需求。我们提取语义特征，计算像素内相关性，并为每个平面构建一个三维相关体积。相关体积表示输入在不同深度与镜像相似的程度，允许我们确定给定平面为镜像平面的可能性。随后，我们将相关体积作为采样平面的特征描述符，并将其映射到采样平面法线所在的单位半球。最后，我们设计了多级球面卷积，以从粗到精的方式确定最佳镜像平面。在合成数据集和真实数据集上的实验表明，3D镜像几何体在提高数据效率和推理速度（高达25 FPS）方面具有优势。
摘要：We introduce a geometry-inspired deep learning method for detecting 3D mirror plane from single-view images. We reduce the demand for massive training data by explicitly adding 3D mirror geometry into learning as an inductive prior. We extract semantic features, calculate intra-pixel correlations, and build a 3D correlation volume for each plane. The correlation volume indicates the extent to which the input resembles its mirrors at various depth, allowing us to identify the likelihood of the given plane being a mirror plane. Subsequently, we treat the correlation volumes as feature descriptors for sampled planes and map them to a unit hemisphere where the normal of sampled planes lies. Lastly, we design multi-stage spherical convolutions to identify the optimal mirror plane in a coarse-to-fine manner. Experiments on both synthetic and real-world datasets show the benefit of 3D mirror geometry in improving data efficiency and inference speed (up to 25 FPS).

【3】 Robust and Precise Facial Landmark Detection by Self-Calibrated Pose Attention Network
标题：基于自标定姿态注意网络的鲁棒精确人脸标志点检测
链接：https://arxiv.org/abs/2112.12328

作者：Jun Wan,Hui Xi,Jie Zhou,Zhihui Lai,Witold Pedrycz,Xu Wang,Hang Sun
机构： and also with the Shenzhen Institute ofArtificial Intelligence and Robotics for Society
备注：Accept by IEEE Transactions on Cybernetics, December 2021
摘要：目前全监督人脸标志点检测方法发展迅速，取得了显著的效果。然而，由于不准确的面部形状约束和标记的训练样本不足，他们在处理大姿势和严重遮挡下的面部时仍然会受到影响。在本文中，我们提出了一种半监督框架，即自校准姿势-注意网络（SCPAN），以在具有挑战性的场景中实现更鲁棒和精确的面部地标检测。具体地说，本文提出了一种边界感知的地标强度场（BALI），通过融合边界和地标强度场信息来建模更有效的人脸形状约束。此外，设计了一个自校准姿势-注意（SCPA）模型，通过引入自校准机制和姿势-注意面具，提供一个自学习的目标函数，该目标函数在没有标签信息的情况下实施中间监督。我们表明，通过将BALI场和SCPA模型集成到一个新的自校准姿势-注意网络中，可以学习更多的人脸先验知识，并且我们的方法对于大姿势和严重遮挡的人脸的检测精度和鲁棒性都得到了提高。对具有挑战性的基准数据集的实验结果表明，我们的方法优于文献中最先进的方法。
摘要：Current fully-supervised facial landmark detection methods have progressed rapidly and achieved remarkable performance. However, they still suffer when coping with faces under large poses and heavy occlusions for inaccurate facial shape constraints and insufficient labeled training samples. In this paper, we propose a semi-supervised framework, i.e., a Self-Calibrated Pose Attention Network (SCPAN) to achieve more robust and precise facial landmark detection in challenging scenarios. To be specific, a Boundary-Aware Landmark Intensity (BALI) field is proposed to model more effective facial shape constraints by fusing boundary and landmark intensity field information. Moreover, a Self-Calibrated Pose Attention (SCPA) model is designed to provide a self-learned objective function that enforces intermediate supervision without label information by introducing a self-calibrated mechanism and a pose attention mask. We show that by integrating the BALI fields and SCPA model into a novel self-calibrated pose attention network, more facial prior knowledge can be learned and the detection accuracy and robustness of our method for faces with large poses and heavy occlusions have been improved. The experimental results obtained for challenging benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in the literature.

【4】 Leveraging Synthetic Data in Object Detection on Unmanned Aerial Vehicles
标题：综合数据在无人机目标检测中的应用
链接：https://arxiv.org/abs/2112.12252

作者：Benjamin Kiefer,David Ott,Andreas Zell
机构：University of Tuebingen
备注：The first two authors contributed equally. Github repository will be made public soon
摘要：在无人机（UAV）上获取数据以训练基于深度学习的目标探测器既昂贵又耗时，甚至可能在特定环境中被法律禁止。另一方面，合成数据访问速度快且成本低。在这项工作中，我们探讨了合成数据在各种应用环境中无人机目标检测中的潜在用途。为此，我们扩展了开源框架DeepGTAV，以适用于无人机场景。通过分析多个模型的多种训练策略，我们在多个领域捕获了各种大规模高分辨率合成数据集，以演示它们在无人机目标检测中的应用。此外，我们还分析了几种不同的数据生成和采样参数，为进一步的科学研究提供可行的工程建议。DeepGTAV框架可在https://git.io/Jyf5j.
摘要：Acquiring data to train deep learning-based object detectors on Unmanned Aerial Vehicles (UAVs) is expensive, time-consuming and may even be prohibited by law in specific environments. On the other hand, synthetic data is fast and cheap to access. In this work, we explore the potential use of synthetic data in object detection from UAVs across various application environments. For that, we extend the open-source framework DeepGTAV to work for UAV scenarios. We capture various large-scale high-resolution synthetic data sets in several domains to demonstrate their use in real-world object detection from UAVs by analyzing multiple training strategies across several models. Furthermore, we analyze several different data generation and sampling parameters to provide actionable engineering advice for further scientific research. The DeepGTAV framework is available at https://git.io/Jyf5j.

【5】 Improved 2D Keypoint Detection in Out-of-Balance and Fall Situations -- combining input rotations and a kinematic model
标题：改进的平衡和跌倒情况下的二维关键点检测--结合输入旋转和运动学模型
链接：https://arxiv.org/abs/2112.12193

作者：Michael Zwölfer,Dieter Heinrich,Kurt Schindelwig,Bastian Wandt,Helge Rhodin,Joerg Spoerri,Werner Nachbauer
机构：University of Innsbruck, University of British Columbia, J¨org Sp¨orri, University of Zurich
备注：extended abstract, 4 pages, 3 figures, 2 tables
摘要：损伤分析可能是基于深度学习的人体姿势估计最有益的应用之一。为了促进这一主题的进一步研究，我们为高山滑雪提供了一个特定于损伤的2D数据集，共覆盖533幅图像。我们进一步提出了一个后处理例程，该例程将旋转信息与简单的运动学模型相结合。我们可以将坠落情况下的检测结果提高21%PCK@0.2米制的
摘要：Injury analysis may be one of the most beneficial applications of deep learning based human pose estimation. To facilitate further research on this topic, we provide an injury specific 2D dataset for alpine skiing, covering in total 533 images. We further propose a post processing routine, that combines rotational information with a simple kinematic model. We could improve detection results in fall situations by up to 21% regarding the PCK@0.2 metric.

分类|识别相关(9篇)

【1】 Assessing the Impact of Attention and Self-Attention Mechanisms on the Classification of Skin Lesions
标题：评估注意力和自我注意机制对皮损分类的影响
链接：https://arxiv.org/abs/2112.12748

作者：Rafael Pedro,Arlindo L. Oliveira
机构：Lisbon, Portugal, INESC-ID Instituto Superior T´ecnico
摘要：注意机制已经引起了研究界的极大兴趣，因为它们有望显著改善神经网络结构的性能。然而，在任何特定的问题上，我们仍然缺乏一种原则性的方法来选择特定的机制和超参数，从而保证改进。最近，自关注被提出并广泛应用于Transformer式结构中，在一些应用中取得了重大突破。在这项工作中，我们关注两种形式的注意机制：注意模块和自我注意。注意模块用于重新加权各层输入张量的特征。不同的模块有不同的方式在完全连接或卷积层中执行此重新称重。研究的注意力模型是完全模块化的，在这项工作中，它们将与流行的ResNet架构一起使用。自我注意，最初是在自然语言处理领域提出的，它使得把输入序列中的所有项目联系起来成为可能。自我关注在计算机视觉中变得越来越流行，在计算机视觉中，自我关注有时与卷积层结合在一起，尽管最近的一些体系结构完全消除了卷积。在这项工作中，我们研究并客观比较了在一项特定的计算机视觉任务中的许多不同注意机制，即广泛使用的皮肤癌MNIST数据集中的样本分类。结果表明，注意模块有时确实改善了卷积神经网络结构的性能，但这种改进虽然明显且具有统计学意义，但在不同的设置下并不一致。另一方面，通过自我注意机制获得的结果显示出一致且显著的改进，即使在参数数量减少的体系结构中也能获得最佳结果。
摘要：Attention mechanisms have raised significant interest in the research community, since they promise significant improvements in the performance of neural network architectures. However, in any specific problem, we still lack a principled way to choose specific mechanisms and hyper-parameters that lead to guaranteed improvements. More recently, self-attention has been proposed and widely used in transformer-like architectures, leading to significant breakthroughs in some applications. In this work we focus on two forms of attention mechanisms: attention modules and self-attention. Attention modules are used to reweight the features of each layer input tensor. Different modules have different ways to perform this reweighting in fully connected or convolutional layers. The attention models studied are completely modular and in this work they will be used with the popular ResNet architecture. Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence. Self-Attention is becoming increasingly popular in Computer Vision, where it is sometimes combined with convolutional layers, although some recent architectures do away entirely with convolutions. In this work, we study and perform an objective comparison of a number of different attention mechanisms in a specific computer vision task, the classification of samples in the widely used Skin Cancer MNIST dataset. The results show that attention modules do sometimes improve the performance of convolutional neural network architectures, but also that this improvement, although noticeable and statistically significant, is not consistent in different settings. The results obtained with self-attention mechanisms, on the other hand, show consistent and significant improvements, leading to the best results even in architectures with a reduced number of parameters.

【2】 3D Skeleton-based Few-shot Action Recognition with JEANIE is not so Naïve
标题：基于3D骨架的珍妮Few-Shot动作识别并不那么幼稚
链接：https://arxiv.org/abs/2112.12668

作者：Lei Wang,Jun Liu,Piotr Koniusz
机构：†The Australian National University, ♠Singapore University of Technology and Design, §Data,CSIRO
备注：Full 17 page version
摘要：在这篇文章中，我们提出了一种通过联合时间和相机视点对齐（JEANIE）的三维骨架动作识别的镜头学习管道。要计算三维车身关节的查询序列和支撑序列之间的偏差，我们提出了一种改进的动态时间扭曲方法，该方法联合建模查询帧和支持帧之间的每条平滑路径，在有限的少量镜头训练数据下，同时在时间和模拟摄像机视点空间中实现端到端学习的最佳对齐。序列用基于简单谱图卷积的时间块编码器编码，这是一种轻量级线性图神经网络主干（我们还包括一个带有Transformer的设置）。最后，我们提出了一种基于相似性的丢失方法，它鼓励同类序列的对齐，同时防止不相关序列的对齐。我们展示了NTU-60、NTU-120、动力学骨架和UWA3D多视图活动II的最新结果。
摘要：In this paper, we propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE). To factor out misalignment between query and support sequences of 3D body joints, we propose an advanced variant of Dynamic Time Warping which jointly models each smooth path between the query and support frames to achieve simultaneously the best alignment in the temporal and simulated camera viewpoint spaces for end-to-end learning under the limited few-shot training data. Sequences are encoded with a temporal block encoder based on Simple Spectral Graph Convolution, a lightweight linear Graph Neural Network backbone (we also include a setting with a transformer). Finally, we propose a similarity-based loss which encourages the alignment of sequences of the same class while preventing the alignment of unrelated sequences. We demonstrate state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II.

【3】 FedFR: Joint Optimization Federated Framework for Generic and Personalized Face Recognition
标题：FedFR：面向通用和个性化人脸识别的联合优化联邦框架
链接：https://arxiv.org/abs/2112.12496

作者：Chih-Ting Liu,Chien-Yi Wang,Shao-Yi Chien,Shang-Hong Lai
机构： Graduate Institute of Electronics and Engineering, National Taiwan University, Microsoft AI R&D Center, Taiwan
备注：This paper was accepted by AAAI 2022 Conference on Artificial Intelligence
摘要：当前最先进的基于深度学习的人脸识别（FR）模型需要大量的人脸身份进行集中训练。然而，由于隐私意识的增强，禁止在用户设备上访问人脸图像以不断改进人脸识别模型。联合学习（FL）是一种解决隐私问题的技术，它可以协同优化模型，而无需在客户端之间共享数据。在这项工作中，我们提出了一个基于FL的框架，称为FedFR，用于以隐私感知的方式改进通用人脸表示。此外，该框架通过提出的解耦特征定制模块，为相应的客户机联合优化个性化模型。特定于客户端的个性化模型可以满足在本地设备上注册身份的优化人脸识别体验的需要。据我们所知，我们是第一个在FL设置中探索个性化人脸识别的人。所提出的框架被验证为优于以前的方法在几个通用和个性化的人脸识别基准与不同的FL场景。FL设置下的源代码和我们建议的个性化FR基准可在https://github.com/jackie840129/FedFR.
摘要：Current state-of-the-art deep learning based face recognition (FR) models require a large number of face identities for central training. However, due to the growing privacy awareness, it is prohibited to access the face images on user devices to continually improve face recognition models. Federated Learning (FL) is a technique to address the privacy issue, which can collaboratively optimize the model without sharing the data between clients. In this work, we propose a FL based framework called FedFR to improve the generic face representation in a privacy-aware manner. Besides, the framework jointly optimizes personalized models for the corresponding clients via the proposed Decoupled Feature Customization module. The client-specific personalized model can serve the need of optimized face recognition experience for registered identities at the local device. To the best of our knowledge, we are the first to explore the personalized face recognition in FL setup. The proposed framework is validated to be superior to previous approaches on several generic and personalized face recognition benchmarks with diverse FL scenarios. The source codes and our proposed personalized FR benchmark under FL setup are available at https://github.com/jackie840129/FedFR.

【4】 Your Face Mirrors Your Deepest Beliefs-Predicting Personality and Morals through Facial Emotion Recognition
标题：你的脸反映了你最深的信仰-通过面部情感识别预测人格和道德
链接：https://arxiv.org/abs/2112.12455

作者：P. A. Gloor,A. Fronzetti Colladon,E. Altuntas,C. Cetinkaya,M. F. Kaiser,L. Ripperger,T. Schaefer
机构：com Department of Data Science, Lucerne University of Applied Sciences and Arts
备注：None
摘要：我们真的能“读懂眼睛里的思想”吗？此外，人工智能能帮助我们完成这项任务吗？本文通过介绍一个机器学习系统来回答这两个问题，该系统根据人脸预测个体的个性特征。它通过面部情绪识别（FER）跟踪个体面部的情绪反应，同时观看15个不同类型的短片。为了校准系统，我们邀请了85人观看视频，同时通过面部表情分析他们的情绪反应。同时，这些人还参加了四项经过充分验证的人格特征和道德价值调查：修订的新FFI人格问卷、海特道德基础测试、施瓦茨个人价值体系和领域特定风险承担量表（DOSPERT）。我们发现，一个人的个性特征和道德价值观可以通过他们对视频的情绪反应来预测，如他们脸上所示，使用梯度增强树的预测准确率高达86%。我们还发现，不同的视频可以更好地预测不同的个性特征，换句话说，没有一个视频可以准确预测所有的个性特征，但只有对不同视频的混合反应才能准确预测。
摘要：Can we really "read the mind in the eyes"? Moreover, can AI assist us in this task? This paper answers these two questions by introducing a machine learning system that predicts personality characteristics of individuals on the basis of their face. It does so by tracking the emotional response of the individual's face through facial emotion recognition (FER) while watching a series of 15 short videos of different genres. To calibrate the system, we invited 85 people to watch the videos, while their emotional responses were analyzed through their facial expression. At the same time, these individuals also took four well-validated surveys of personality characteristics and moral values: the revised NEO FFI personality inventory, the Haidt moral foundations test, the Schwartz personal value system, and the domain-specific risk-taking scale (DOSPERT). We found that personality characteristics and moral values of an individual can be predicted through their emotional response to the videos as shown in their face, with an accuracy of up to 86% using gradient-boosted trees. We also found that different personality characteristics are better predicted by different videos, in other words, there is no single video that will provide accurate predictions for all personality characteristics, but it is the response to the mix of different videos that allows for accurate prediction.

【5】 InstaIndoor and Multi-modal Deep Learning for Indoor Scene Recognition
标题：室内和多模态深度学习在室内场景识别中的应用
链接：https://arxiv.org/abs/2112.12409

作者：Andreea Glavan,Estefania Talavera
机构：Received: date Accepted: date
摘要：室内场景识别是一个不断发展的领域，在行为理解、机器人定位和老年人监控等方面具有巨大潜力。在这项研究中，我们从一个新的角度来处理场景识别的任务，使用从社交媒体收集的多模式学习和视频数据。社交媒体视频的可访问性和多样性可以为现代场景识别技术和应用提供逼真的数据。我们提出了一个基于转录语音到文本和视觉特征融合的模型，该模型用于一个新的室内场景社交媒体视频数据集Insta室内的分类。我们的模型达到70%的准确率和0.7 F1分数。此外，我们还通过在YouTube-8M室内场景子集上进行基准测试，强调了我们方法的潜力，该方法的准确率为74%，F1得分为0.74。我们希望这项工作的贡献为室内场景识别这一富有挑战性的领域的新研究铺平道路。
摘要：Indoor scene recognition is a growing field with great potential for behaviour understanding, robot localization, and elderly monitoring, among others. In this study, we approach the task of scene recognition from a novel standpoint, using multi-modal learning and video data gathered from social media. The accessibility and variety of social media videos can provide realistic data for modern scene recognition techniques and applications. We propose a model based on fusion of transcribed speech to text and visual features, which is used for classification on a novel dataset of social media videos of indoor scenes named InstaIndoor. Our model achieves up to 70% accuracy and 0.7 F1-Score. Furthermore, we highlight the potential of our approach by benchmarking on a YouTube-8M subset of indoor scenes as well, where it achieves 74% accuracy and 0.74 F1-Score. We hope the contributions of this work pave the way to novel research in the challenging field of indoor scene recognition.

【6】 Human Activity Recognition on wrist-worn accelerometers using self-supervised neural networks
标题：基于自监督神经网络的腕式加速度计人体活动识别
链接：https://arxiv.org/abs/2112.12272

作者：Niranjan Sridhar,Lance Myers
机构：Verily Life Sciences, LLC (Alphabet), South San Francisco, California, USA, Address for correspondence:, E Grand Ave, South San Francisco, CA ,-
摘要：日常生活活动量（ADL）是衡量整体健康的重要指标，但在临床上很难测量。使用佩戴在手腕上的加速计实现自动准确的人类活动识别（HAR），从而实现对ADL的实用且经济高效的远程监控。开发高质量的HAR的关键障碍是缺乏大型标记数据集，以及在现实生活中，将基于小型管理数据集训练的模型应用于连续的异构数据流时的性能损失。在这项工作中，我们设计了一个自我监督学习范式，以创建一个能够在设备和主题之间概括的加速计数据的鲁棒表示。我们证明，这种表示方法可以分离日常生活活动，并使用很少的标签（在多个基准数据集上）实现很高的HAR准确性。我们还提出了一种分割算法，该算法可以在连续的真实数据上识别显著活动的片段，并提高HAR的准确性。
摘要：Measures of Activity of Daily Living (ADL) are an important indicator of overall health but difficult to measure in-clinic. Automated and accurate human activity recognition (HAR) using wrist-worn accelerometers enables practical and cost efficient remote monitoring of ADL. Key obstacles in developing high quality HAR is the lack of large labeled datasets and the performance loss when applying models trained on small curated datasets to the continuous stream of heterogeneous data in real-life. In this work we design a self-supervised learning paradigm to create a robust representation of accelerometer data that can generalize across devices and subjects. We demonstrate that this representation can separate activities of daily living and achieve strong HAR accuracy (on multiple benchmark datasets) using very few labels. We also propose a segmentation algorithm which can identify segments of salient activity and boost HAR accuracy on continuous real-life data.

【7】 MC-DGCNN: A Novel DNN Architecture for Multi-Category Point Set Classification
标题：MC-DGCNN：一种新的多类别点集分类DNN结构
链接：https://arxiv.org/abs/2112.12219

作者：Majid Farhadloo,Carl Molnar,Gaoxiang Luo,Yan Li,Shashi Shekhar,Rachel L. Maus,Svetomir N. Markovic,Raymond Moore,Alexey Leontovich
摘要：点集分类旨在建立一个表示学习模型，区分点集数据的空间和分类配置。这个问题在社会上很重要，因为在许多应用领域，如免疫学和微生物生态学。这一问题具有挑战性，因为不同类别的点之间的相互作用并不总是相等的；因此，表征学习模型必须有选择地学习最相关的多范畴关系。相关工作受限于（1）学习不同多范畴关系的重要性，特别是对于高阶相互作用，（2）除了简单测量相对距离或将前馈神经网络应用于坐标之外，没有充分利用点的空间分布。为了克服这些限制，我们利用动态图卷积神经网络（DGCNN）结构设计了一种新的多类别DGCNN（MC-DGCNN），为多类别点集分类提供了位置表示和点对注意层。MC-DGCNN能够识别每个点对的分类重要性，并将其扩展到N向空间关系，同时仍然保留DGCNN的所有属性和优点（例如可微性）。实验结果表明，该体系结构具有较高的计算效率，在实际数据集上明显优于现有的深度学习体系结构。
摘要：Point set classification aims to build a representation learning model that distinguishes between spatial and categorical configurations of point set data. This problem is societally important since in many applications domains such as immunology, and microbial ecology. This problem is challenging since the interactions between different categories of points are not always equal; as a result, the representation learning model must selectively learn the most relevant multi-categorical relationships. The related works are limited (1) in learning the importance of different multi-categorical relationships, especially for high-order interactions, and (2) do not fully exploit the spatial distribution of points beyond simply measuring relative distance or applying a feed-forward neural network to coordinates. To overcome these limitations, we leverage the dynamic graph convolutional neural network (DGCNN) architecture to design a novel multi-category DGCNN (MC-DGCNN), contributing location representation and point pair attention layers for multi-categorical point set classification. MC-DGCNN has the ability to identify the categorical importance of each point pair and extends this to N-way spatial relationships, while still preserving all the properties and benefits of DGCNN (e.g., differentiability). Experimental results show that the proposed architecture is computationally efficient and significantly outperforms current deep learning architectures on real-world datasets.

【8】 Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition
标题：复出、参加还是卷积？动作识别中跨域健壮性的帧依赖建模问题
链接：https://arxiv.org/abs/2112.12175

作者：Sofia Broomé,Ernest Pokropek,Boyu Li,Hedvig Kjellström
机构：Hedvig Kjellstr¨om, Silo AI, Sweden
摘要：如今，大多数动作识别模型都是高度参数化的，并在具有主要空间不同类别的数据集上进行评估。以前对单个图像的研究结果表明，2D卷积神经网络（CNN）在各种计算机视觉任务中倾向于纹理而不是形状（Geirhos等人，2019），从而降低了泛化。综上所述，这引起了人们的怀疑，即大型视频模型学习虚假的相关性，而不是随着时间的推移跟踪相关的形状，并从它们的运动中推断出可概括的语义。在学习随时间变化的视觉模式时，避免参数爆炸的一种自然方法是利用时间轴上的重复性。在本文中，我们分别实证研究了递归、基于注意和卷积视频模型的跨域鲁棒性，以研究这种鲁棒性是否受到帧依赖模型的影响。我们提出了一种新的时态形状数据集，作为一种轻量级数据集，用于评估在单个帧中未显示的时态形状之间的泛化能力。我们发现，当控制性能和层结构时，递归模型在时间形状数据集上表现出比卷积和基于注意的模型更好的域外泛化能力。此外，我们的实验表明，基于卷积和注意的模型在Diving48上比递归模型表现出更多的纹理偏差。
摘要：Most action recognition models today are highly parameterized, and evaluated on datasets with predominantly spatially distinct classes. Previous results for single images have shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape for various computer vision tasks (Geirhos et al., 2019), reducing generalization. Taken together, this raises suspicion that large video models learn spurious correlations rather than to track relevant shapes over time and infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence across the time-axis. In this article, we empirically study the cross-domain robustness for recurrent, attention-based and convolutional video models, respectively, to investigate whether this robustness is influenced by the frame dependency modeling. Our novel Temporal Shape dataset is proposed as a light-weight dataset to assess the ability to generalize across temporal shapes which are not revealed from single frames. We find that when controlling for performance and layer structure, recurrent models show better out-of-domain generalization ability on the Temporal Shape dataset than convolution- and attention-based models. Moreover, our experiments indicate that convolution- and attention-based models exhibit more texture bias on Diving48 than recurrent models.

【9】 KFWC: A Knowledge-Driven Deep Learning Model for Fine-grained Classification of Wet-AMD
标题：KFWC：一种知识驱动的湿性AMD细粒度分类深度学习模型
链接：https://arxiv.org/abs/2112.12386

作者：Haihong E,Jiawen He,Tianyi Hu,Lifei Wang,Lifei Yuan,Ruru Zhang,Meina Song
机构： Beijing University of Posts and Telecommunications, Hebei Eye Hospital, Education Department Information Network Engineering Research Center, (Beijing University of Posts and Telecommunications)
摘要：使用深度神经网络进行自动诊断可以帮助眼科医生检测致盲性眼病——湿性年龄相关性黄斑变性（AMD）。湿性AMD有两种相似的亚型，新生血管性AMD和息肉状脉络膜血管（PCV）。然而，由于数据收集的困难和图像之间的相似性，大多数研究只实现了湿AMD的粗粒度分类，而不是湿AMD亚型的细粒度分类。为了解决这个问题，本文提出了一种知识驱动的细粒度湿AMD分类模型（KFWC），用于在数据不足的情况下对细粒度疾病进行分类。通过在KFWC中引入输入图像的10个病变体征的先验知识，我们旨在通过多标签分类预训练加速KFWC，在细粒度疾病分类任务中定位决定性的图像特征，从而实现更好的分类。同时，KFWC还可以提供良好的可解释性，有效缓解湿性AMD细粒度疾病分类领域的数据收集和注释压力。实验证明了KFWC的有效性，AU-ROC得分达到99.71%，与数据驱动的w/o知识和眼科医生相比，KFWC的有效性有了显著提高，与最强基线相比，KFWC的有效性提高了6.69%，与眼科医生相比，KFWC的有效性提高了4.14%。
摘要：Automated diagnosis using deep neural networks can help ophthalmologists detect the blinding eye disease wet Age-related Macular Degeneration (AMD). Wet-AMD has two similar subtypes, Neovascular AMD and Polypoidal Choroidal Vessels (PCV). However, due to the difficulty in data collection and the similarity between images, most studies have only achieved the coarse-grained classification of wet-AMD rather than a finer-grained one of wet-AMD subtypes. To solve this issue, in this paper we propose a Knowledge-driven Fine-grained Wet-AMD Classification Model (KFWC), to classify fine-grained diseases with insufficient data. With the introduction of a priori knowledge of 10 lesion signs of input images into the KFWC, we aim to accelerate the KFWC by means of multi-label classification pre-training, to locate the decisive image features in the fine-grained disease classification task and therefore achieve better classification. Simultaneously, the KFWC can also provide good interpretability and effectively alleviate the pressure of data collection and annotation in the field of fine-grained disease classification for wet-AMD. The experiments demonstrate the effectiveness of the KFWC which reaches 99.71% in AU-ROC scores, and its considerable improvements over the data-driven w/o Knowledge and ophthalmologists, with the rates of 6.69% over the strongest baseline and 4.14% over ophthalmologists.

分割|语义相关(6篇)

【1】 TagLab: A human-centric AI system for interactive semantic segmentation
标题：TagLab：一个以人为中心的交互式语义切分人工智能系统
链接：https://arxiv.org/abs/2112.12702

作者：Gaia Pavoni,Massimiliano Corsini,Federico Ponchio,Alessandro Muntoni,Paolo Cignoni
机构：Visual Computing Lab, ISTI-CNR, Pisa, Italy
备注：Accepted at Human Centered AI workshop at NeurIPS 2021, this https URL
摘要：高度特定语义类和复杂形状的全自动语义分割可能无法满足科学家要求的精度标准。在这种情况下，以人为中心的人工智能解决方案能够帮助操作员，同时保持对复杂任务的人工控制，这是一个很好的折衷方案，可以在保持高精度水平的同时加快图像标记速度。TagLab是一个开源的人工智能辅助软件，用于注释大型正射影像，利用不同程度的自动化；它通过辅助工具从零开始加速图像注释，创建定制的全自动语义分割模型，最后允许快速编辑自动预测。由于orthoimages分析适用于多个科学学科，TagLab设计了灵活的标记管道。我们在两个不同的场景中报告我们的结果，海洋生态和建筑遗产。
摘要：Fully automatic semantic segmentation of highly specific semantic classes and complex shapes may not meet the accuracy standards demanded by scientists. In such cases, human-centered AI solutions, able to assist operators while preserving human control over complex tasks, are a good trade-off to speed up image labeling while maintaining high accuracy levels. TagLab is an open-source AI-assisted software for annotating large orthoimages which takes advantage of different degrees of automation; it speeds up image annotation from scratch through assisted tools, creates custom fully automatic semantic segmentation models, and, finally, allows the quick edits of automatic predictions. Since the orthoimages analysis applies to several scientific disciplines, TagLab has been designed with a flexible labeling pipeline. We report our results in two different scenarios, marine ecology, and architectural heritage.

【2】 FourierMask: Instance Segmentation using Fourier Mapping in Implicit Neural Networks
标题：傅立叶掩码：隐式神经网络中基于傅立叶映射的实例分割
链接：https://arxiv.org/abs/2112.12535

作者：Hamd ul Moqeet Riaz,Nuri Benbarka,Timon Hoeffer,Andreas Zell
机构：Department of Computer Science (WSI), University of Tuebingen, Germany
摘要：我们提出了FourierMask，它采用Fourier级数和隐式神经表示相结合来生成实例分割mask。我们将傅里叶映射（FM）应用于坐标位置，并将映射的特征作为隐式表示（基于坐标的多层感知器（MLP））的输入。FourierMask学习预测特定实例的FM系数，从而使FM适应特定对象。这允许对FourierMask进行推广，以从自然图像预测实例分割掩码。由于隐函数在输入坐标域中是连续的，我们说明了通过对输入像素坐标进行亚采样，我们可以在推理过程中生成更高分辨率的掩码。此外，我们还针对FourierMask的不确定预测训练了一个渲染器MLP（FourierRend），并说明它显著提高了掩模的质量。FourierMask在MS COCO数据集上显示出与基线Mask R-CNN在相同输出分辨率下具有竞争力的结果，并且在更高分辨率上优于基线Mask R-CNN。
摘要：We present FourierMask, which employs Fourier series combined with implicit neural representations to generate instance segmentation masks. We apply a Fourier mapping (FM) to the coordinate locations and utilize the mapped features as inputs to an implicit representation (coordinate-based multi-layer perceptron (MLP)). FourierMask learns to predict the coefficients of the FM for a particular instance, and therefore adapts the FM to a specific object. This allows FourierMask to be generalized to predict instance segmentation masks from natural images. Since implicit functions are continuous in the domain of input coordinates, we illustrate that by sub-sampling the input pixel coordinates, we can generate higher resolution masks during inference. Furthermore, we train a renderer MLP (FourierRend) on the uncertain predictions of FourierMask and illustrate that it significantly improves the quality of the masks. FourierMask shows competitive results on the MS COCO dataset compared to the baseline Mask R-CNN at the same output resolution and surpasses it on higher resolution.

【3】 Iteratively Selecting an Easy Reference Frame Makes Unsupervised Video Object Segmentation Easier
标题：反复选择容易的参考帧使得无监督视频对象分割更容易
链接：https://arxiv.org/abs/2112.12402

作者：Youngjo Lee,Hongje Seong,Euntai Kim
机构：School of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea
备注：Accepted to AAAI 2022
摘要：无监督视频对象分割（Unsupervised video object segmentation，UVOS）是一个逐像素的二值标记问题，其目的是在不使用前景对象的地面真值（ground truth，GT）遮罩的情况下，将前景对象从视频中的背景中分离出来。以前的大多数UVOS模型都使用第一帧或整个视频作为参考帧来指定前景对象的遮罩。我们的问题是为什么要选择第一帧作为参考帧，或者为什么要使用整个视频来指定遮罩。我们相信，与仅使用第一帧或整个视频作为参考帧相比，我们可以选择更好的参考帧来实现更好的UVOS性能。在本文中，我们提出了简易帧选择器（EFS）。EFS使我们能够选择一个“容易”的参考框架，使后续VO变得容易，从而提高VOS性能。此外，我们提出了一个新的框架称为迭代掩模预测（IMP）。在该框架中，我们重复将EFS应用于给定的视频，并从视频中选择一个比上一次迭代更容易的参考帧，从而逐步提高VOS性能。IMP包括EFS、双向掩模预测（BMP）和时间信息更新（TIU）。根据提出的框架，我们在三个UVO基准集：DAVIS16、FBMS和SegTrack-V2中实现了最先进的性能。
摘要：Unsupervised video object segmentation (UVOS) is a per-pixel binary labeling problem which aims at separating the foreground object from the background in the video without using the ground truth (GT) mask of the foreground object. Most of the previous UVOS models use the first frame or the entire video as a reference frame to specify the mask of the foreground object. Our question is why the first frame should be selected as a reference frame or why the entire video should be used to specify the mask. We believe that we can select a better reference frame to achieve the better UVOS performance than using only the first frame or the entire video as a reference frame. In our paper, we propose Easy Frame Selector (EFS). The EFS enables us to select an 'easy' reference frame that makes the subsequent VOS become easy, thereby improving the VOS performance. Furthermore, we propose a new framework named as Iterative Mask Prediction (IMP). In the framework, we repeat applying EFS to the given video and selecting an 'easier' reference frame from the video than the previous iteration, increasing the VOS performance incrementally. The IMP consists of EFS, Bi-directional Mask Prediction (BMP), and Temporal Information Updating (TIU). From the proposed framework, we achieve state-of-the-art performance in three UVOS benchmark sets: DAVIS16, FBMS, and SegTrack-V2.

【4】 A Random Point Initialization Approach to Image Segmentation with Variational Level-sets
标题：一种变水平集图像分割的随机点初始化方法
链接：https://arxiv.org/abs/2112.12355

作者：J. N. Mueller,J. N. Corcoran
机构： CorcoranUniversity of Colorado Boulder , Department of Applied Mathematics, University of Colorado
备注：17 pages, 27 figures
摘要：图像分割是许多图像处理和计算机视觉任务中的重要组成部分。图像分割的主要目标是简化图像以便于分析，实现这一点有两种广泛的方法：基于边缘的方法（提取特定已知对象的边界）和基于区域的方法（将图像划分为统计上均匀的区域）。其中一种比较突出的边缘查找方法称为水平集方法，该方法通过梯度下降在图像平面中生成零水平轮廓，直到轮廓收敛到对象边界。虽然经典的水平集方法及其变体在分割真实图像方面已被证明是成功的，但在没有图像先验知识的情况下，它们很容易陷入图像平面的噪声区域，并且无法提供对象外部边界位置以外的细节。我们提出了一种改进的变分水平集图像分割方法，该方法利用随机点初始化快速检测目标边界。通过比较我们的方法在真实图像上的性能与著名的Canny方法的性能，我们证明了我们方法的有效性。
摘要：Image segmentation is an essential component in many image processing and computer vision tasks. The primary goal of image segmentation is to simplify an image for easier analysis, and there are two broad approaches for achieving this: edge based methods, which extract the boundaries of specific known objects, and region based methods, which partition the image into regions that are statistically homogeneous. One of the more prominent edge finding methods, known as the level set method, evolves a zero-level contour in the image plane with gradient descent until the contour has converged to the object boundaries. While the classical level set method and its variants have proved successful in segmenting real images, they are susceptible to becoming stuck in noisy regions of the image plane without a priori knowledge of the image and they are unable to provide details beyond object outer boundary locations. We propose a modification to the variational level set image segmentation method that can quickly detect object boundaries by making use of random point initialization. We demonstrate the efficacy of our approach by comparing the performance of our method on real images to that of the prominent Canny Method.

【5】 Maximum Entropy on Erroneous Predictions (MEEP): Improving model calibration for medical image segmentation
标题：错误预测的最大熵(MEEP)：改进的医学图像分割模型校正
链接：https://arxiv.org/abs/2112.12218

作者：Agostina Larrazabal,Cesar Martinez,Jose Dolz,Enzo Ferrante
机构： CONICET, Universidad Nacional del Litoral, Argentina, ETS Montreal, Canada
摘要：现代深度神经网络在医学图像分割方面取得了显著的进展。然而，最近有人观察到，即使在高度不确定性的情况下，它们也会产生过度自信的估计，导致校准不良和不可靠的模型。在这项工作中，我们引入了错误预测最大熵（MEEP），这是一种用于分割网络的训练策略，它选择性地惩罚过度自信的预测，只关注错误分类的像素。特别是，我们设计了一个正则化项，鼓励错误预测的高熵后验，增加了复杂场景中的网络不确定性。我们的方法与神经结构无关，不增加模型复杂度，并且可以与多个分割损失函数耦合。我们在两个具有挑战性的医学图像分割任务中对所提出的策略进行了基准测试：大脑磁共振图像（MRI）中的白质高强度病变，以及心脏MRI中的心房分割。实验结果表明，将MEEP与标准分割损耗相结合，不仅可以提高模型的标定精度，而且可以提高分割质量。
摘要：Modern deep neural networks have achieved remarkable progress in medical image segmentation tasks. However, it has recently been observed that they tend to produce overconfident estimates, even in situations of high uncertainty, leading to poorly calibrated and unreliable models. In this work we introduce Maximum Entropy on Erroneous Predictions (MEEP), a training strategy for segmentation networks which selectively penalizes overconfident predictions, focusing only on misclassified pixels. In particular, we design a regularization term that encourages high entropy posteriors for wrong predictions, increasing the network uncertainty in complex scenarios. Our method is agnostic to the neural architecture, does not increase model complexity and can be coupled with multiple segmentation loss functions. We benchmark the proposed strategy in two challenging medical image segmentation tasks: white matter hyperintensity lesions in magnetic resonance images (MRI) of the brain, and atrial segmentation in cardiac MRI. The experimental results demonstrate that coupling MEEP with standard segmentation losses leads to improvements not only in terms of model calibration, but also in segmentation quality.

【6】 Omni-Seg: A Single Dynamic Network for Multi-label Renal Pathology Image Segmentation using Partially Labeled Data
标题：Omni-Seg：一种使用部分标记数据的多标记肾脏病理图像分割的单一动态网络
链接：https://arxiv.org/abs/2112.12665

作者：Ruining Deng,Quan Liu,Can Cui,Zuhayr Asad,Haichun Yang,Yuankai Huo
机构： Vanderbilt University, Department of Computer Science, Nashville, TN, USA , Vanderbilt University Medical Center, Department of Pathology, Nashville, TN, USA , Editors: Under Review for MIDL
摘要：计算机辅助对千兆像素病理图像进行定量分析为精确医学提供了新的途径。这些创新主要集中在癌症病理学（即肿瘤分割和表征）。在非癌症病理学中，可以要求学习算法同时检查更全面的组织类型，作为多标签设置。现有技术通常需要训练多个分割网络，以便匹配异质组织类型（例如，肾小球簇、肾小球单位、近端小管、远端小管、管周毛细血管和动脉）的领域特定知识。在本文中，我们提出了一种动态单一分割网络（Omni Seg），该网络学习使用部分标记图像（即，每个训练图像仅标记一种组织类型）分割多个组织类型，用于肾脏病理学。通过从六种组织类型的约150000张逐片病理图像中学习，与以前的多网络和多头部设计相比，所提出的Omni-Seg网络实现了更高的分割精度和更少的资源消耗。在测试阶段，该方法仅使用“部分标记”的训练图像获得“完全标记”的组织分割结果。源代码可在https://github.com/ddrrnn123/Omni-Seg.
摘要：Computer-assisted quantitative analysis on Giga-pixel pathology images has provided a new avenue in precision medicine. The innovations have been largely focused on cancer pathology (i.e., tumor segmentation and characterization). In non-cancer pathology, the learning algorithms can be asked to examine more comprehensive tissue types simultaneously, as a multi-label setting. The prior arts typically needed to train multiple segmentation networks in order to match the domain-specific knowledge for heterogeneous tissue types (e.g., glomerular tuft, glomerular unit, proximal tubular, distal tubular, peritubular capillaries, and arteries). In this paper, we propose a dynamic single segmentation network (Omni-Seg) that learns to segment multiple tissue types using partially labeled images (i.e., only one tissue type is labeled for each training image) for renal pathology. By learning from ~150,000 patch-wise pathological images from six tissue types, the proposed Omni-Seg network achieved superior segmentation accuracy and less resource consumption when compared to the previous the multiple-network and multi-head design. In the testing stage, the proposed method obtains "completely labeled" tissue segmentation results using only "partially labeled" training images. The source code is available at https://github.com/ddrrnn123/Omni-Seg.

Zero/Few Shot|迁移|域适配|自适应(6篇)

【1】 Boosting Generative Zero-Shot Learning by Synthesizing Diverse Features with Attribute Augmentation
标题：综合不同特征和属性增强促进生成性Zero-Shot学习
链接：https://arxiv.org/abs/2112.12573

作者：Xiaojie Zhao,Yuming Shen,Shidong Wang,Haofeng Zhang
机构：School of Computer Science and Engineering, Nanjing University of Science and Technology, China, Department of Engineering Science, University of Oxford, UK, School of Engineering, Newcastle University, UK
备注：Accepted by AAAI2022
摘要：深度生成模型的最新进展勾勒出了Zero-Shot学习（ZSL）领域的一个前景。大多数生成性ZSL方法使用类别语义属性加上高斯噪声来生成视觉特征。在生成看不见的样本后，这一系列方法有效地将ZSL问题转化为监督分类方案。然而，现有的模型使用单个语义属性，其中包含类别的完整属性信息。生成的数据也包含完整的属性信息，但在现实中，视觉样本通常具有有限的属性。因此，从属性生成的数据可能具有不完整的语义。基于这一事实，我们提出了一个新的框架，通过综合不同的特征来增强ZSL。该方法利用增广的语义属性来训练生成模型，从而模拟视觉特征的真实分布。我们在四个基准数据集上评估了所提出的模型，观察到相对于最新技术的显著性能改进。
摘要：The recent advance in deep generative models outlines a promising perspective in the realm of Zero-Shot Learning (ZSL). Most generative ZSL methods use category semantic attributes plus a Gaussian noise to generate visual features. After generating unseen samples, this family of approaches effectively transforms the ZSL problem into a supervised classification scheme. However, the existing models use a single semantic attribute, which contains the complete attribute information of the category. The generated data also carry the complete attribute information, but in reality, visual samples usually have limited attributes. Therefore, the generated data from attribute could have incomplete semantics. Based on this fact, we propose a novel framework to boost ZSL by synthesizing diverse features. This method uses augmented semantic attributes to train the generative model, so as to simulate the real distribution of visual features. We evaluate the proposed model on four benchmark datasets, observing significant performance improvement against the state-of-the-art.

【2】 Pose Adaptive Dual Mixup for Few-Shot Single-View 3D Reconstruction
标题：用于Few-Shot单视三维重建的姿态自适应双混合算法
链接：https://arxiv.org/abs/2112.12484

作者：Ta-Ying Cheng,Hsuan-Ru Yang,Niki Trigoni,Hwann-Tzong Chen,Tyng-Luh Liu
机构： Institute of Information Science, Academia Sinica, Taiwan, Department of Computer Science, National Tsing Hua University, Taiwan, Department of Computer Science, University of Oxford, UK
备注：To appear in the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI), February 2022
摘要：我们提出了一种用于单图像三维重建的姿态自适应Few-Shot学习过程和两阶段数据插值正则化，称为姿态自适应双重混合（PADMix）。虽然通过插值特征标签对进行的增强在分类任务中是有效的，但由于渲染视点未知时两个图像和体积的插值乘积之间的不一致性，它们在形状预测方面可能存在不足。PADMix针对这个问题，按顺序执行两组混音程序。我们首先执行输入合成，结合姿势自适应学习过程，有助于学习2D特征提取和姿势自适应潜在编码。阶段性训练允许我们在姿势不变表示的基础上，在特征和地面真实体积之间的一对一对应关系下执行后续潜在混合。与ShapeNet数据集相比，PADMix在少数镜头设置方面的表现明显优于以前的文献，并在更具挑战性的真实Pix3D数据集上设置了新的基准。
摘要：We present a pose adaptive few-shot learning procedure and a two-stage data interpolation regularization, termed Pose Adaptive Dual Mixup (PADMix), for single-image 3D reconstruction. While augmentations via interpolating feature-label pairs are effective in classification tasks, they fall short in shape predictions potentially due to inconsistencies between interpolated products of two images and volumes when rendering viewpoints are unknown. PADMix targets this issue with two sets of mixup procedures performed sequentially. We first perform an input mixup which, combined with a pose adaptive learning procedure, is helpful in learning 2D feature extraction and pose adaptive latent encoding. The stagewise training allows us to build upon the pose invariant representations to perform a follow-up latent mixup under one-to-one correspondences between features and ground-truth volumes. PADMix significantly outperforms previous literature on few-shot settings over the ShapeNet dataset and sets new benchmarks on the more challenging real-world Pix3D dataset.

【3】 Adaptive Modeling Against Adversarial Attacks
标题：抗敌意攻击的自适应建模
链接：https://arxiv.org/abs/2112.12431

作者：Zhiwen Yan,Teck Khim Ng
机构： can then accurately calculate thegradient of the model and produce accurate adversarial in-Equal contribution 1School of Computing, University ofSingapore
备注：10 pages, 3 figures
摘要：对抗性训练是利用对抗性数据训练深度学习模型的过程，是深度学习模型中最成功的对抗性防御方法之一。我们发现，如果我们在推理阶段微调该模型以适应对抗性输入，并在其中加入额外信息，则对抗性训练模型对白盒攻击的鲁棒性可以进一步提高。我们介绍了一种算法，该算法在推理阶段使用现有的训练数据在原始输出类和“邻居”类之间“后训练”模型。使用该算法，预训练的快速FGSM CIFAR10分类器基模型对白盒投影梯度攻击（PGD）的准确率可以从46.8%显著提高到64.5%。
摘要：Adversarial training, the process of training a deep learning model with adversarial data, is one of the most successful adversarial defense methods for deep learning models. We have found that the robustness to white-box attack of an adversarially trained model can be further improved if we fine tune this model in inference stage to adapt to the adversarial input, with the extra information in it. We introduce an algorithm that "post trains" the model at inference stage between the original output class and a "neighbor" class, with existing training data. The accuracy of pre-trained Fast-FGSM CIFAR10 classifier base model against white-box projected gradient attack (PGD) can be significantly improved from 46.8% to 64.5% with our algorithm.

【4】 A Practical Data-Free Approach to One-shot Federated Learning with Heterogeneity
标题：一种实用的异构一次联邦学习无数据方法
链接：https://arxiv.org/abs/2112.12371

作者：Jie Zhang,Chen Chen,Bo Li,Lingjuan Lyu,Shuang Wu,Jianghe Xu,Shouhong Ding,Chao Wu
机构： Zhejiang University, Tencent Youtu Lab, Sony AI
摘要：一次性联合学习（FL）最近成为一种很有前途的方法，它允许中央服务器在一轮通信中学习模型。尽管通信成本较低，但现有的一次性FL方法大多不切实际或面临固有的限制，例如，需要公共数据集，客户的模型是同质的，需要上传额外的数据/模型信息。为了克服这些问题，我们提出了一种更实用的无数据方法FedSyn，用于具有异构性的一次性FL框架。我们的FedSyn通过数据生成阶段和模型提取阶段来训练全局模型。据我们所知，FedSyn是第一种可以实际应用于各种实际应用的方法，因为它具有以下优点：（1）FedSyn不需要在客户端和服务器之间传输额外的信息（模型参数除外）；（2） FedSyn不需要任何辅助数据集进行训练；（3）FEDSYN是第一个考虑模型和统计异质性的FL，即客户端数据是非IID，不同的客户端可能有不同的模型体系结构。在各种真实数据集上的实验证明了我们的FedSyn的优越性。例如，当数据为非iid时，FedSyn在CIFAR10数据集上的性能比ADI提供的最佳基线方法高5.08%。
摘要：One-shot Federated Learning (FL) has recently emerged as a promising approach, which allows the central server to learn a model in a single communication round. Despite the low communication cost, existing one-shot FL methods are mostly impractical or face inherent limitations, e.g., a public dataset is required, clients' models are homogeneous, need to upload additional data/model information. To overcome these issues, we propose a more practical data-free approach named FedSyn for one-shot FL framework with heterogeneity. Our FedSyn trains the global model by a data generation stage and a model distillation stage. To the best of our knowledge, FedSyn is the first method that can be practically applied to various real-world applications due to the following advantages: (1) FedSyn requires no additional information (except the model parameters) to be transferred between clients and the server; (2) FedSyn does not require any auxiliary dataset for training; (3) FedSyn is the first to consider both model and statistical heterogeneities in FL, i.e., the clients' data are non-iid and different clients may have different model architectures. Experiments on a variety of real-world datasets demonstrate the superiority of our FedSyn. For example, FedSyn outperforms the best baseline method Fed-ADI by 5.08% on CIFAR10 dataset when data are non-iid.

【5】 More is Better: A Novel Multi-view Framework for Domain Generalization
标题：越多越好：一种新的领域综合多视图框架
链接：https://arxiv.org/abs/2112.12329

作者：Jian Zhang,Lei Qi,Yinghuan Shi,Yang Gao
机构： Nanjing University, Southeast University
摘要：为了将源域训练的模型推广到不可见的目标域，域综合（DG）近年来受到了广泛的关注。DG的关键问题是如何防止过度拟合观测到的源域，因为目标域在训练期间不可用。我们研究了过度拟合不仅会导致对未知目标域的泛化能力下降，而且会导致测试阶段的预测不稳定。在本文中，我们观察到在训练阶段对多个任务进行采样和在测试阶段生成增强图像在很大程度上有利于泛化性能。因此，通过将任务和图像视为不同的视图，我们提出了一种新的多视图DG框架。具体地说，在训练阶段，为了增强泛化能力，我们开发了一种多视图正则化元学习算法，该算法在更新模型时使用多个任务来产生合适的优化方向。在测试阶段，为了缓解不稳定的预测，我们利用多个增强图像进行多视图预测，通过融合测试图像不同视图的结果，显著提高了模型的可靠性。在三个基准数据集上的大量实验验证了我们的方法优于几种最先进的方法。
摘要：Aiming to generalize the model trained in source domains to unseen target domains, domain generalization (DG) has attracted lots of attention recently. The key issue of DG is how to prevent overfitting to the observed source domains because target domain is unavailable during training. We investigate that overfitting not only causes the inferior generalization ability to unseen target domains but also leads unstable prediction in the test stage. In this paper, we observe that both sampling multiple tasks in training stage and generating augmented images in test stage largely benefit generalization performance. Thus, by treating tasks and images as different views, we propose a novel multi-view DG framework. Specifically, in training stage, to enhance generalization ability, we develop a multi-view regularized meta-learning algorithm that employs multiple tasks to produce a suitable optimization direction during updating model. In test stage, to alleviate unstable prediction, we utilize multiple augmented images to yield multi-view prediction, which significantly promotes model reliability via fusing the results of different views of a test image. Extensive experiments on three benchmark datasets validate our method outperforms several state-of-the-art approaches.

【6】 InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images
标题：InDuDoNet+：一种模型驱动的可解释双域CT图像金属伪影消除方法
链接：https://arxiv.org/abs/2112.12660

作者：Hong Wang,Yuexiang Li,Haimiao Zhang,Deyu Meng,Yefeng Zheng
机构：Xi’an Jiaotong University, Xi’an, China, Tencent Jarvis Lab, Shenzhen, China, Beijing Information Science and Technology University, Beijing, China
摘要：在计算机断层扫描（CT）成像过程中，患者体内的金属植入物总是会产生有害的伪影，从而降低重建CT图像的视觉质量，并对后续的临床诊断产生负面影响。对于金属伪影减少（MAR）任务，目前基于深度学习的方法已经取得了令人满意的效果。然而，它们大多有两个共同的局限性：1）CT物理成像几何约束未全面纳入深部网络结构；2）整个框架对特定MAR任务的解释能力较弱；因此，很难评估每个网络模块的作用。为了缓解这些问题，在本文中，我们构建了一个新的可解释的双域网络，称为InDuDoNet+，其中精细地嵌入了CT成像过程。具体地说，我们推导了一个联合的空间域和Radon域重建模型，并提出了一个只需简单算子的优化算法来求解该模型。通过将该算法中涉及的迭代步骤展开到相应的网络模块中，我们可以轻松地构建具有清晰解释性的InDuDoNet+。此外，我们还分析了不同组织之间的CT值，并将先验观测值合并到InDuDoNet+的先验网络中，这显著提高了其泛化性能。对合成数据和临床数据的综合实验证实了所提出方法的优越性，以及优于当前最先进的（SOTA）MAR方法的泛化性能。代码位于\url{https://github.com/hongwang01/InDuDoNet_plus}.
摘要：During the computed tomography (CT) imaging process, metallic implants within patients always cause harmful artifacts, which adversely degrade the visual quality of reconstructed CT images and negatively affect the subsequent clinical diagnosis. For the metal artifact reduction (MAR) task, current deep learning based methods have achieved promising performance. However, most of them share two main common limitations: 1) the CT physical imaging geometry constraint is not comprehensively incorporated into deep network structures; 2) the entire framework has weak interpretability for the specific MAR task; hence, the role of every network module is difficult to be evaluated. To alleviate these issues, in the paper, we construct a novel interpretable dual domain network, termed InDuDoNet+, into which CT imaging process is finely embedded. Concretely, we derive a joint spatial and Radon domain reconstruction model and propose an optimization algorithm with only simple operators for solving it. By unfolding the iterative steps involved in the proposed algorithm into the corresponding network modules, we easily build the InDuDoNet+ with clear interpretability. Furthermore, we analyze the CT values among different tissues, and merge the prior observations into a prior network for our InDuDoNet+, which significantly improve its generalization performance. Comprehensive experiments on synthesized data and clinical data substantiate the superiority of the proposed methods as well as the superior generalization performance beyond the current state-of-the-art (SOTA) MAR methods. Code is available at \url{https://github.com/hongwang01/InDuDoNet_plus}.

半弱无监督|主动学习|不确定性(4篇)

【1】 SLIP: Self-supervision meets Language-Image Pre-training
标题：幻灯片：自我监督与语言形象前期训练相遇
链接：https://arxiv.org/abs/2112.12750

作者：Norman Mu,Alexander Kirillov,David Wagner,Saining Xie
机构：UC Berkeley, Facebook AI Research (FAIR)
备注：Code: this https URL
摘要：最近的研究表明，在具有挑战性的视觉识别任务中，自我监督的预训练比有监督的学习有更好的效果。CLIP是一种令人兴奋的语言监督学习新方法，在各种基准测试中表现出了良好的表现。在这项工作中，我们探讨了自我监督学习是否有助于将语言监督用于视觉表征学习。我们介绍了SLIP，一个结合自监督学习和剪辑预训练的多任务学习框架。在使用视觉变换器进行预训练后，我们彻底评估了表示质量，并在三种不同设置下比较了CLIP和自监督学习的性能：Zero-Shot转移、线性分类和端到端微调。通过ImageNet和一组额外的数据集，我们发现SLIP大大提高了准确性。我们通过在不同模型大小、训练计划和训练前数据集上的实验进一步验证了我们的结果。我们的研究结果表明，SLIP在这两个方面都是最好的：表现优于自我监督（+8.1%的线性准确性）和语言监督（+5.2%的Zero-Shot准确性）。
摘要：Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training with Vision Transformers, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy).

【2】 Digital Editions as Distant Supervision for Layout Analysis of Printed Books
标题：电子版作为印刷图书排版分析的远程监控
链接：https://arxiv.org/abs/2112.12703

作者：Alejandro H. Toselli,Si Wu,David A. Smith
机构：Khoury College of Computer Sciences, Northeastern University, Boston, MA , U.S.A.
备注：15 pages, 2 figures. International Conference on Document Analysis and Recognition. Springer, Cham, 2021
摘要：档案管理员、文本学者和历史学家经常制作历史文件的数字版本。这些数字版本使用诸如文本编码计划（Text Encoding Initiative）和EpiDoc等标记方案，通常记录文档的语义区域（如注释和数字）和物理特征（如换页和换行符），并转录其文本内容。我们描述了利用这种语义标记作为远程监控来训练和评估布局分析模型的方法。在Deutsches Textarchiv（DTA）50万页上的几个模型架构的实验中，我们发现这些区域级评估方法与像素级和字级度量具有高度相关性。我们讨论了通过自我训练提高准确性的可能性，以及在DTA上训练的模型推广到其他历史印刷书籍的能力。
摘要：Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.

【3】 Learning Hierarchical Attention for Weakly-supervised Chest X-Ray Abnormality Localization and Diagnosis
标题：弱监督胸部X线异常定位与诊断的分层注意学习
链接：https://arxiv.org/abs/2112.12349

作者：Xi Ouyang,Srikrishna Karanam,Ziyan Wu,Terrence Chen,Jiayu Huo,Xiang Sean Zhou,Qian Wang,Jie-Zhi Cheng
机构： Jiayu Huo and Qian Wang are with the Institute for MedicalImaging Technology, School of Biomedical Engineering, Shanghai JiaoTong University
备注：None
摘要：我们考虑异常定位的临床应用问题。虽然深度学习推动了医学影像学的许多最新进展，但许多临床挑战尚未完全解决，限制了其更广泛的应用。虽然最近的方法报告了较高的诊断准确率，但由于普遍缺乏算法决策推理和可解释性，医生担心将这些算法结果用于诊断决策目的。解决这个问题的一个潜在方法是，除了对异常进行分类之外，还进一步训练这些模型来定位异常。然而，准确地做到这一点需要临床专家进行大量的疾病定位注释，对于大多数应用来说，完成这项任务的成本高得令人望而却步。在这项工作中，我们通过一种新的注意驱动的弱监督算法朝着解决这些问题迈出了一步，该算法包括一个分层注意挖掘框架，该框架以整体方式统一了基于激活和梯度的视觉注意。我们的关键算法创新包括设计明确的顺序注意约束，以弱监督方式进行原则性模型训练，同时通过定位线索促进视觉注意驱动模型解释的生成。在两个大型胸部X射线数据集（NIH ChestX-ray14和CheXpert）上，我们展示了相对于当前技术水平的显著定位性能改进，同时也实现了具有竞争力的分类性能。我们的代码可在https://github.com/oyxhust/HAM.
摘要：We consider the problem of abnormality localization for clinical applications. While deep learning has driven much recent progress in medical imaging, many clinical challenges are not fully addressed, limiting its broader usage. While recent methods report high diagnostic accuracies, physicians have concerns trusting these algorithm results for diagnostic decision-making purposes because of a general lack of algorithm decision reasoning and interpretability. One potential way to address this problem is to further train these models to localize abnormalities in addition to just classifying them. However, doing this accurately will require a large amount of disease localization annotations by clinical experts, a task that is prohibitively expensive to accomplish for most applications. In this work, we take a step towards addressing these issues by means of a new attention-driven weakly supervised algorithm comprising a hierarchical attention mining framework that unifies activation- and gradient-based visual attention in a holistic manner. Our key algorithmic innovations include the design of explicit ordinal attention constraints, enabling principled model training in a weakly-supervised fashion, while also facilitating the generation of visual-attention-driven model explanations by means of localization cues. On two large-scale chest X-ray datasets (NIH ChestX-ray14 and CheXpert), we demonstrate significant localization performance improvements over the current state of the art while also achieving competitive classification performance. Our code is available on https://github.com/oyxhust/HAM.

【4】 Fine-grained Multi-Modal Self-Supervised Learning
标题：细粒度多模态自监督学习
链接：https://arxiv.org/abs/2112.12182

作者：Duo Wang,Salah Karout
机构：Department of Computer Science and, Technology, University of Cambridge, Cambridge, UK, Huawei R&D UK, Cambridge Science Park
备注：Accepted at BMVC 2021
摘要：视频多模式自监督学习已被证明可以提高模型在各种下游任务上的性能。然而，由于未处理数据中存在噪声，这种自监督预训练需要大量批量和大量计算资源。这部分是由于流行的训练方案是在粗粒度设置上训练的，其中表示整个视频片段或自然语言句子的向量用于计算相似度。由于视频片段的一部分和其他模态输入（如文本描述）完全不相关，这种方案使得训练变得有噪声。在本文中，我们提出了一种细粒度多模态自监督训练方案，该方案在更精细的尺度上计算嵌入之间的相似性（例如单个特征映射嵌入和短语嵌入），并使用注意机制来减少噪声对在损失函数中的权重。我们表明，通过所提出的预训练方案，我们可以训练更小的模型，更小的批量和更少的计算资源，以实现与最新技术相当的下游任务性能，包括动作识别和文本图像检索任务。
摘要：Multi-Modal Self-Supervised Learning from videos has been shown to improve model's performance on various downstream tasks. However, such Self-Supervised pre-training requires large batch sizes and a large amount of computation resources due to the noise present in the uncurated data. This is partly due to the fact that the prevalent training scheme is trained on coarse-grained setting, in which vectors representing the whole video clips or natural language sentences are used for computing similarity. Such scheme makes training noisy as part of the video clips can be totally not correlated with the other-modality input such as text description. In this paper, we propose a fine-grained multi-modal self-supervised training scheme that computes the similarity between embeddings at finer-scale (such as individual feature map embeddings and embeddings of phrases), and uses attention mechanisms to reduce noisy pairs' weighting in the loss function. We show that with the proposed pre-training scheme, we can train smaller models, with smaller batch-size and much less computational resources to achieve downstream tasks performances comparable to State-Of-The-Art, for tasks including action recognition and text-image retrievals.

时序|行为识别|姿态|视频|运动估计(1篇)

【1】 BANMo: Building Animatable 3D Neural Models from Many Casual Videos
标题：BANMO：从许多随意视频中构建可动画的3D神经模型
链接：https://arxiv.org/abs/2112.12761

作者：Gengshan Yang,Minh Vo,Natalia Neverova,Deva Ramanan,Andrea Vedaldi,Hanbyul Joo
机构：Meta AI, Carnegie Mellon University, Meta Reality Labs, Casual Videos of An Object, BANMo, Bone, Color: Skinning weights, Canonical Space, Pose , View , Canonical Embeddings
摘要：关节式三维形状重建之前的工作通常依赖于专门的传感器（例如，同步多摄像头系统）或预构建的三维可变形模型（例如，SMAL或SMPL）。这种方法无法扩展到野外的不同对象集。我们提出了BANMo，这种方法既不需要专门的传感器，也不需要预定义的模板形状。BANMo通过不同的渲染框架，从许多单目休闲视频中构建高保真、清晰的3D模型（包括形状和可设置动画的蒙皮权重）。虽然许多视频的使用提供了更多关于相机视图和对象表达的报道，但它们在建立不同背景、照明条件等场景之间的对应关系方面带来了重大挑战。我们的关键洞察是将三个思想流派合并；（1）经典的可变形形状模型，利用关节骨骼和混合蒙皮，（2）体积神经辐射场（NeRFs），适用于基于梯度的优化，（3）标准嵌入，生成像素和关节模型之间的对应关系。我们引入神经混合蒙皮模型，允许可微和可逆的关节变形。当与规范嵌入相结合时，这样的模型允许我们在视频中建立密集的对应关系，这种关系可以通过周期一致性进行自我监督。在真实数据集和合成数据集上，BANMo显示出比之前针对人类和动物的作品更高保真的3D重建，能够从新颖的视点和姿势渲染真实图像。项目网页：banmo-www.github。木卫一。
摘要：Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses. Project webpage: banmo-www.github.io .

医学相关(2篇)

【1】 INTRPRT: A Systematic Review of and Guidelines for Designing and Validating Transparent AI in Medical Image Analysis
标题：INTRPRT：设计和验证医学图像分析中透明人工智能的系统评价和指南
链接：https://arxiv.org/abs/2112.12596

作者：Haomin Chen,Catalina Gomez,Chien-Ming Huang,Mathias Unberath
机构：Department of Computer Science, Johns Hopkins University
摘要：机器学习的透明度（ML），试图揭示复杂模型的工作机制。透明ML承诺在目标用户中推进以人为中心的人工智能的人为因素工程目标。从以人为本的设计角度来看，透明度不是ML模型的属性，而是一种启示，即算法和用户之间的关系；因此，与用户进行迭代原型设计和评估对于获得足够的解决方案、提供透明度至关重要。然而，由于终端用户的可用性和访问权限有限，在医疗保健和医学图像分析中遵循以人为本的设计原则具有挑战性。为了研究医学图像分析中透明ML的状态，我们对文献进行了系统的回顾。我们的综述揭示了用于医学图像分析应用的透明ML在设计和验证方面的多个严重缺陷。我们发现，迄今为止的大多数研究都将透明度作为模型本身的一个属性，类似于任务性能，在开发和评估过程中都没有考虑最终用户。此外，缺乏用户研究，以及透明性声明的零星验证，使得用于医学图像分析的透明ML的当代研究面临用户无法理解的风险，因此与临床无关。为了缓解即将进行的研究中的这些缺点，同时承认医疗保健中以人为中心的设计的挑战，我们引入了INTRPRT指南，这是一项针对医学图像分析中透明ML系统的系统设计指令。INTRPRT指南建议将形成性用户研究作为透明模型设计的第一步，以了解用户需求和领域需求。遵循这一过程产生了支持设计选择的证据，并最终增加了算法提供透明度的可能性。
摘要：Transparency in Machine Learning (ML), attempts to reveal the working mechanisms of complex models. Transparent ML promises to advance human factors engineering goals of human-centered AI in the target users. From a human-centered design perspective, transparency is not a property of the ML model but an affordance, i.e. a relationship between algorithm and user; as a result, iterative prototyping and evaluation with users is critical to attaining adequate solutions that afford transparency. However, following human-centered design principles in healthcare and medical image analysis is challenging due to the limited availability of and access to end users. To investigate the state of transparent ML in medical image analysis, we conducted a systematic review of the literature. Our review reveals multiple severe shortcomings in the design and validation of transparent ML for medical image analysis applications. We find that most studies to date approach transparency as a property of the model itself, similar to task performance, without considering end users during neither development nor evaluation. Additionally, the lack of user research, and the sporadic validation of transparency claims put contemporary research on transparent ML for medical image analysis at risk of being incomprehensible to users, and thus, clinically irrelevant. To alleviate these shortcomings in forthcoming research while acknowledging the challenges of human-centered design in healthcare, we introduce the INTRPRT guideline, a systematic design directive for transparent ML systems in medical image analysis. The INTRPRT guideline suggests formative user research as the first step of transparent model design to understand user needs and domain requirements. Following this process produces evidence to support design choices, and ultimately, increases the likelihood that the algorithms afford transparency.

【2】 AI-based Reconstruction for Fast MRI -- A Systematic Review and Meta-analysis
标题：基于人工智能的快速MRI重建--系统评价和荟萃分析
链接：https://arxiv.org/abs/2112.12744

作者：Yutong Chen,Carola-Bibiane Schönlieb,Pietro Liò,Tim Leiner,Pier Luigi Dragotti,Ge Wang,Daniel Rueckert,David Firmin,Guang Yang
机构： National Heart & Lung Institute, Imperial College London, London SW,NP, U.K., Cardiovascular Research Centre, Royal Brompton Hospital, London SW,NP, U.K., University of Cambridge, Cambridge CB,RX, U.K..
备注：42 pages, 5 figures, Proceedings of the IEEE
摘要：压缩感知（CS）在加速磁共振成像（MRI）采集过程中起着关键作用。随着人工智能的复兴，深度神经网络和CS算法正在被集成，以重新定义快速MRI的最新技术。在过去的几年中，基于深度学习的CS技术在复杂性、多样性和性能方面有了长足的发展，这些技术致力于快速MRI。在这项荟萃分析中，我们系统地回顾了用于快速MRI的基于深度学习的CS技术，描述了关键模型设计，突出了突破，并讨论了有希望的方向。我们还引入了一个综合分析框架和分类系统，以评估深度学习在基于CS的MRI加速中的关键作用。
摘要：Compressed sensing (CS) has been playing a key role in accelerating the magnetic resonance imaging (MRI) acquisition process. With the resurgence of artificial intelligence, deep neural networks and CS algorithms are being integrated to redefine the state of the art of fast MRI. The past several years have witnessed substantial growth in the complexity, diversity, and performance of deep learning-based CS techniques that are dedicated to fast MRI. In this meta-analysis, we systematically review the deep learning-based CS techniques for fast MRI, describe key model designs, highlight breakthroughs, and discuss promising directions. We have also introduced a comprehensive analysis framework and a classification system to assess the pivotal role of deep learning in CS-based acceleration for MRI.

GAN|对抗|攻击|生成相关(3篇)

【1】 NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning
标题：忍者描述：基于对抗性学习的内容隐藏视觉描述符
链接：https://arxiv.org/abs/2112.12785

作者：Tony Ng,Hyo Jin Kim,Vincent Lee,Daniel Detone,Tsun-Yi Yang,Tianwei Shen,Eddy Ilg,Vassileios Balntas,Krystian Mikolajczyk,Chris Sweeney
机构：Reality Labs, Meta, Imperial College London
摘要：根据最近对视觉描述符中场景显示隐私的分析，我们开发了隐藏输入图像内容的描述符。特别是，我们提出了一个对抗式学习框架，用于训练视觉描述符，防止图像重建，同时保持匹配精度。我们让特征编码网络和图像重建网络相互竞争，使得特征编码器试图用其生成的描述符阻止图像重建，而重建者试图从描述符恢复输入图像。实验结果表明，该方法得到的视觉描述子显著降低了图像重建质量，对对应匹配和摄像机定位性能的影响最小。
摘要：In the light of recent analyses on privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance.

【2】 Comparison and Analysis of Image-to-Image Generative Adversarial Networks: A Survey
标题：图像到图像生成性对抗网络的比较与分析：综述
链接：https://arxiv.org/abs/2112.12625

作者：Sagar Saxena,Mohammad Nayeem Teli
机构：Received: date Accepted: date
备注：22 pages, 22 figures, Preprint, Under review at IJCV
摘要：生成性对抗网络（GAN）最近引入了执行图像到图像翻译的有效方法。这些模型可以在不改变任何参数的情况下应用并推广到图像到图像转换的各种领域。在本文中，我们调查和分析了八个图像对图像生成的对抗网络：Pix2Px、CycleGAN、CoGAN、StarGAN、MUNIT、StarGAN2、DA-GAN和自我关注GAN。这些模型中的每一个都展示了最先进的结果，并引入了新的技术来构建图像到图像的GANs。除了对模型进行调查外，我们还调查了他们接受训练的18个数据集和他们进行评估的9个指标。最后，我们给出了其中6个模型在一组通用指标和数据集上的控制实验结果。结果好坏参半，表明在某些数据集、任务和指标上，一些模型的表现优于其他模型。本文的最后一部分讨论了这些结果，并确定了未来的研究领域。随着研究人员不断创新新的图像间GAN，他们对现有方法、数据集和度量有很好的理解是很重要的。本文提供了一个全面的概述和讨论，以帮助建立这个基础。
摘要：Generative Adversarial Networks (GANs) have recently introduced effective methods of performing Image-to-Image translations. These models can be applied and generalized to a variety of domains in Image-to-Image translation without changing any parameters. In this paper, we survey and analyze eight Image-to-Image Generative Adversarial Networks: Pix2Px, CycleGAN, CoGAN, StarGAN, MUNIT, StarGAN2, DA-GAN, and Self Attention GAN. Each of these models presented state-of-the-art results and introduced new techniques to build Image-to-Image GANs. In addition to a survey of the models, we also survey the 18 datasets they were trained on and the 9 metrics they were evaluated on. Finally, we present results of a controlled experiment for 6 of these models on a common set of metrics and datasets. The results were mixed and showed that on certain datasets, tasks, and metrics some models outperformed others. The last section of this paper discusses those results and establishes areas of future research. As researchers continue to innovate new Image-to-Image GANs, it is important that they gain a good understanding of the existing methods, datasets, and metrics. This paper provides a comprehensive overview and discussion to help build this foundation.

【3】 Manifold Learning Benefits GANs
标题：多方面的学习益处甘斯
链接：https://arxiv.org/abs/2112.12618

作者：Yao Ni,Piotr Koniusz,Richard Hartley,Richard Nock
机构：†The Australian National University, §Data,CSIRO, ♦Google Research
备注：30 pages full version
摘要：在本文中，我们通过在鉴别器中加入流形学习步骤来改进生成性对抗网络。我们考虑局部约束线性和基于子空间的流形，以及局部约束的非线性流形。在我们的设计中，流形学习和编码步骤与鉴别器层交织在一起，目的是将中间特征表示吸引到流形上。我们自适应地平衡特征表示和流形视图之间的差异，这表示在流形上去噪和细化流形之间的权衡。我们得出结论，局部约束非线性流形由于其非均匀密度和光滑性而优于线性流形。我们展示了与不同的最新技术基线相比的实质性改进。
摘要：In this paper, we improve Generative Adversarial Networks by incorporating a manifold learning step into the discriminator. We consider locality-constrained linear and subspace-based manifolds, and locality-constrained non-linear manifolds. In our design, the manifold learning and coding steps are intertwined with layers of the discriminator, with the goal of attracting intermediate feature representations onto manifolds. We adaptively balance the discrepancy between feature representations and their manifold view, which represents a trade-off between denoising on the manifold and refining the manifold. We conclude that locality-constrained non-linear manifolds have the upper hand over linear manifolds due to their non-uniform density and smoothness. We show substantial improvements over different recent state-of-the-art baselines.

自动驾驶|车辆|车道检测等(1篇)

【1】 PandaSet: Advanced Sensor Suite Dataset for Autonomous Driving
标题：PandaSet：用于自动驾驶的高级传感器套件数据集
链接：https://arxiv.org/abs/2112.12610

作者：Pengchuan Xiao,Zhenlei Shao,Steven Hao,Zishuo Zhang,Xiaolin Chai,Judy Jiao,Zesong Li,Jian Wu,Kai Sun,Kun Jiang,Yunlong Wang,Diange Yang
机构： School of Vehicle and Mobility, TsinghuaUniversity
备注：This paper has been published on ITSC'2021, please check the website of the PandaSet for more information: this https URL
摘要：自动驾驶技术的加速发展对获取大量高质量数据提出了更高的要求。具有代表性的、有标签的真实世界数据是训练深度学习网络的燃料，对于改进自驾驶感知算法至关重要。在本文中，我们介绍了PandaSet，这是第一个由一个完整的、高精度的、具有免费商业许可证的自动车辆传感器套件生成的数据集。使用一台360度机械旋转激光雷达、一台前向远程激光雷达和6台摄像机收集数据集。该数据集包含100多个场景，每个场景长8秒，并提供28种类型的对象分类标签和37种类型的语义分割标签。我们为纯激光雷达三维目标检测、激光雷达相机融合三维目标检测和激光雷达点云分割提供基线。有关PandaSet和开发工具包的更多详细信息，请参阅https://scale.com/open-datasets/pandaset.
摘要：The accelerating development of autonomous driving technology has placed greater demands on obtaining large amounts of high-quality data. Representative, labeled, real world data serves as the fuel for training deep learning networks, critical for improving self-driving perception algorithms. In this paper, we introduce PandaSet, the first dataset produced by a complete, high-precision autonomous vehicle sensor kit with a no-cost commercial license. The dataset was collected using one 360{\deg} mechanical spinning LiDAR, one forward-facing, long-range LiDAR, and 6 cameras. The dataset contains more than 100 scenes, each of which is 8 seconds long, and provides 28 types of labels for object classification and 37 types of labels for semantic segmentation. We provide baselines for LiDAR-only 3D object detection, LiDAR-camera fusion 3D object detection and LiDAR point cloud segmentation. For more details about PandaSet and the development kit, see https://scale.com/open-datasets/pandaset.

NAS模型搜索(1篇)

【1】 Neuroevolution deep learning architecture search for estimation of river surface elevation from photogrammetric Digital Surface Models
标题：神经进化深度学习结构在从摄影测量数字表面模型估算河流表面高程中的搜索
链接：https://arxiv.org/abs/2112.12510

作者：Radosław Szostak,Marcin Pietroń,Mirosław Zimnoch,Przemysław Wachniew,Paweł Ćwiąkała,Edyta Puniach
机构：AGH UST, Marcin Pietro´n, Paweł ´Cwi ˛akała
备注：extended version of NeurIPS 2021 Workshop paper - ML4PhysicalSciences
摘要：鉴于与全球变暖有关的极端水文事件日益频繁，对水的需求日益增加，开发地表水观测的新方法至关重要。使用无人机摄影测量获得的正射影像和数字表面模型（DSM）可用于确定河流的水面高程（WSE）。然而，由于摄影测量算法的限制，DSMs上的水面受到干扰，这项任务很困难。在这项研究中，机器学习用于从受干扰的摄影测量数据中提取WSE值。水文学和摄影测量专家为此专门准备了一个全新的数据集。新方法是实现高时空分辨率水面测量自动化的重要一步。这些数据可用于验证和校准水文、水力和水动力模型，使水文预报更加准确，特别是预测洪水或干旱等极端和危险事件。据我们所知，这是第一种为此目的创建数据集并为此任务使用深度学习模型的方法。此外，神经进化算法被设置为探索不同的架构以找到局部最优模型，并执行非梯度搜索以微调模型参数。与通过摄影测量DSM确定WSE的手动方法相比，所获得的结果具有更好的精度。
摘要：Development of the new methods of surface water observation is crucial in the perspective of increasingly frequent extreme hydrological events related to global warming and increasing demand for water. Orthophotos and digital surface models (DSMs) obtained using UAV photogrammetry can be used to determine the Water Surface Elevation (WSE) of a river. However, this task is difficult due to disturbances of the water surface on DSMs caused by limitations of photogrammetric algorithms. In this study, machine learning was used to extract a WSE value from disturbed photogrammetric data. A brand new dataset has been prepared specifically for this purpose by hydrology and photogrammetry experts. The new method is an important step toward automating water surface level measurements with high spatial and temporal resolution. Such data can be used to validate and calibrate of hydrological, hydraulic and hydrodynamic models making hydrological forecasts more accurate, in particular predicting extreme and dangerous events such as floods or droughts. For our knowledge this is the first approach in which dataset was created for this purpose and deep learning models were used for this task. Additionally, neuroevolution algorithm was set to explore different architectures to find local optimal models and non-gradient search was performed to fine-tune the model parameters. The achieved results have better accuracy compared to manual methods of determining WSE from photogrammetric DSMs.

图像视频检索|Re-id相关(1篇)

【1】 Cross Modal Retrieval with Querybank Normalisation
标题：基于查询库归一化的跨模态检索
链接：https://arxiv.org/abs/2112.12777

作者：Simion-Vlad Bogolin,Ioana Croitoru,Hailin Jin,Yang Liu,Samuel Albanie
机构：Visual Geometry Group, Univ. of Oxford, Inst. of Mathematics of the Romanian Academy, Adobe Research, Wangxuan Inst. of Computer Technology, Peking Univ., Dept. of Engineering, Univ. of Cambridge
摘要：得益于大规模训练数据集、神经结构设计的进步和高效推理，联合嵌入已成为解决跨模态检索的主要方法。在这项工作中，我们首先表明，尽管联合嵌入非常有效，但最先进的联合嵌入仍然严重地受到长期存在的Humbness问题的影响，在这个问题中，少数库嵌入形成了许多查询的最近邻。受NLP文献的启发，我们制定了一个简单但有效的框架，称为Querybank规范化（QB Norm），该框架重新规范了查询相似性，以说明嵌入空间中的中心。QB规范在不需要再训练的情况下提高了检索性能。与以前的工作不同，我们证明了QB规范在不并发访问任何测试集查询的情况下有效地工作。在QB范数框架内，我们还提出了一种新的相似度归一化方法——动态反向Softmax，该方法比现有方法具有更强的鲁棒性。我们展示了一系列跨模式检索模型和基准的QB规范，在这些模型和基准中，QB规范始终增强了超越最先进水平的强大基线。代码可在https://vladbogo.github.io/QB-Norm/.
摘要：Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding hubness problem in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 NVS-MonoDepth: Improving Monocular Depth Prediction with Novel View Synthesis
标题：NVS-MonoDepth：用新的视图合成改进单目深度预测
链接：https://arxiv.org/abs/2112.12577

作者：Zuria Bauer,Zuoyue Li,Sergio Orts-Escolano,Miguel Cazorla,Marc Pollefeys,Martin R. Oswald
机构：University of Alicante, ETH Zurich, University of Amsterdam, Microsoft
备注：None
摘要：在新的视图合成的最新进展的基础上，我们提出了它在改进单目深度估计方面的应用。特别是，我们提出了一种新的训练方法，分为三个主要步骤。首先，将单目深度网络的预测结果扭曲到另一个视点。其次，我们应用了一个额外的图像合成网络，该网络校正并提高了扭曲RGB图像的质量。通过最小化像素级RGB重建误差，要求该网络的输出看起来尽可能类似于地面真实视图。第三，我们对合成的第二视点重新应用相同的单目深度估计，并确保深度预测与相关的地面真实深度一致。实验结果证明，我们的方法在KITTI和NYU-Depth-v2数据集上实现了最先进的性能或可比的性能，具有轻量级和简单的香草U-Net体系结构。
摘要：Building upon the recent progress in novel view synthesis, we propose its application to improve monocular depth estimation. In particular, we propose a novel training method split in three main steps. First, the prediction results of a monocular depth network are warped to an additional view point. Second, we apply an additional image synthesis network, which corrects and improves the quality of the warped RGB image. The output of this network is required to look as similar as possible to the ground-truth view by minimizing the pixel-wise RGB reconstruction error. Third, we reapply the same monocular depth estimation onto the synthesized second view point and ensure that the depth predictions are consistent with the associated ground truth depth. Experimental results prove that our method achieves state-of-the-art or comparable performance on the KITTI and NYU-Depth-v2 datasets with a lightweight and simple vanilla U-Net architecture.

其他神经网络|深度学习|模型|建模(4篇)

【1】 PyCIL: A Python Toolbox for Class-Incremental Learning
标题：PyCIL：一个用于课堂增量学习的Python工具箱
链接：https://arxiv.org/abs/2112.12533

作者：Da-Wei Zhou,Fu-Yun Wang,Han-Jia Ye,De-Chuan Zhan
机构：State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing , China
备注：Technical report. Code is available at this https URL
摘要：传统的机器学习系统是在封闭世界环境下部署的，在离线训练过程之前需要完整的训练数据。然而，现实世界中的应用程序经常面临新的类，模型应该不断地合并它们。这种学习范式称为课堂增量学习（CIL）。我们提出了一个Python工具箱，它实现了几个用于类增量学习的关键算法，以减轻机器学习社区研究人员的负担。工具箱包含了CIL的许多创始工作的实现，如EWC和iCaRL，但也提供了当前最先进的算法，可用于进行新的基础研究。这个名为PyCIL for Python类增量学习的工具箱位于https://github.com/G-U-N/PyCIL
摘要：Traditional machine learning systems are deployed under the closed-world setting, which requires the entire training data before the offline training process. However, real-world applications often face the incoming new classes, and a model should incorporate them continually. The learning paradigm is called Class-Incremental Learning (CIL). We propose a Python toolbox that implements several key algorithms for class-incremental learning to ease the burden of researchers in the machine learning community. The toolbox contains implementations of a number of founding works of CIL such as EWC and iCaRL, but also provides current state-of-the-art algorithms that can be used for conducting novel fundamental research. This toolbox, named PyCIL for Python Class-Incremental Learning, is available at https://github.com/G-U-N/PyCIL

【2】 DILF-EN framework for Class-Incremental Learning
标题：班级增量学习的DILF-EN框架
链接：https://arxiv.org/abs/2112.12385

作者：Mohammed Asad Karim,Indu Joshi,Pratik Mazumder,Pravendra Singh
机构：Independent Researcher, India, Inria Sophia Antipolis, France, IIT Kanpur, India, IIT Roorkee, India
备注：Under Review
摘要：深度学习模型在旧阶段会遭受灾难性的遗忘，因为它们在课堂增量学习环境中接受新阶段引入的课程的训练。在这项工作中，我们发现灾难性遗忘对模型预测的影响随同一图像方向的变化而变化，这是一个新的发现。基于此，我们提出了一种新的数据集成方法，该方法结合了对图像不同方向的预测，以帮助模型保留关于先前看到的类的进一步信息，从而减少遗忘对模型预测的影响。但是，如果使用传统技术对模型进行训练，则无法直接使用数据集成方法。因此，我们还提出了一种新的双增量学习框架，该框架涉及使用两个增量学习目标联合训练网络，即班级增量学习目标和我们提出的数据增量学习目标。在双增量学习框架中，每个图像都属于两个类，即图像类（用于类增量学习）和方向类（用于数据增量学习）。在课堂增量学习中，每个新阶段都会引入一组新的课程，模型无法访问旧阶段的完整训练数据。在我们提出的数据增量学习中，定向课程在所有阶段都保持不变，新阶段增量学习引入的数据充当这些定向课程的新训练数据。我们的经验表明，双增量学习框架对数据集成方法至关重要。我们将我们提出的方法应用于最先进的课堂增量学习方法，并通过实证证明我们的框架显著提高了这些方法的性能。
摘要：Deep learning models suffer from catastrophic forgetting of the classes in the older phases as they get trained on the classes introduced in the new phase in the class-incremental learning setting. In this work, we show that the effect of catastrophic forgetting on the model prediction varies with the change in orientation of the same image, which is a novel finding. Based on this, we propose a novel data-ensemble approach that combines the predictions for the different orientations of the image to help the model retain further information regarding the previously seen classes and thereby reduce the effect of forgetting on the model predictions. However, we cannot directly use the data-ensemble approach if the model is trained using traditional techniques. Therefore, we also propose a novel dual-incremental learning framework that involves jointly training the network with two incremental learning objectives, i.e., the class-incremental learning objective and our proposed data-incremental learning objective. In the dual-incremental learning framework, each image belongs to two classes, i.e., the image class (for class-incremental learning) and the orientation class (for data-incremental learning). In class-incremental learning, each new phase introduces a new set of classes, and the model cannot access the complete training data from the older phases. In our proposed data-incremental learning, the orientation classes remain the same across all the phases, and the data introduced by the new phase in class-incremental learning acts as new training data for these orientation classes. We empirically demonstrate that the dual-incremental learning framework is vital to the data-ensemble approach. We apply our proposed approach to state-of-the-art class-incremental learning methods and empirically show that our framework significantly improves the performance of these methods.

【3】 Dual Path Structural Contrastive Embeddings for Learning Novel Objects
标题：用于学习新对象的双路径结构对比嵌入
链接：https://arxiv.org/abs/2112.12359

作者：Bingbin Li,Elvis Han Cui,Yanan Li,Donghui Wang,Weng Wong
机构：Member, IEEE
摘要：在机器学习领域，从极少数标记样本中学习新类已经引起了越来越多的关注。最近对基于元学习或基于迁移学习的范式的研究表明，在一个好的特征空间中获取信息是一个有效的解决方案，可以在少量任务中获得良好的性能。在本文中，我们提出了一个简单但有效的范例，该范例将学习特征表示和分类器的任务解耦，并且仅通过典型的迁移学习训练策略从基类学习特征嵌入结构。为了保持基本类和新类之间的泛化能力以及每个类内部的辨别能力，我们提出了一种双路径特征学习方案，该方案有效地将结构相似性与对比特征构建相结合。这样，类内对齐和类间一致性都可以很好地平衡，从而提高性能。在三个流行的基准测试上的实验表明，当结合一个简单的基于原型的分类器时，我们的方法仍然可以在归纳或归纳推理环境中对标准和广义Few-Shot问题取得令人满意的结果。
摘要：Learning novel classes from a very few labeled samples has attracted increasing attention in machine learning areas. Recent research on either meta-learning based or transfer-learning based paradigm demonstrates that gaining information on a good feature space can be an effective solution to achieve favorable performance on few-shot tasks. In this paper, we propose a simple but effective paradigm that decouples the tasks of learning feature representations and classifiers and only learns the feature embedding architecture from base classes via the typical transfer-learning training strategy. To maintain both the generalization ability across base and novel classes and discrimination ability within each class, we propose a dual path feature learning scheme that effectively combines structural similarity with contrastive feature construction. In this way, both inner-class alignment and inter-class uniformity can be well balanced, and result in improved performance. Experiments on three popular benchmarks show that when incorporated with a simple prototype based classifier, our method can still achieve promising results for both standard and generalized few-shot problems in either an inductive or transductive inference setting.

【4】 Revisiting Transformation Invariant Geometric Deep Learning: Are Initial Representations All You Need?
标题：重温变换不变几何深度学习：初始表示是您所需要的全部吗？
链接：https://arxiv.org/abs/2112.12345

作者：Ziwei Zhang,Xin Wang,Zeyang Zhang,Peng Cui,Wenwu Zhu
机构：Tsinghua University, Beijing, China
备注：11 pages
摘要：几何深度学习，即设计神经网络来处理无处不在的几何数据，如点云和图形，在过去十年中取得了巨大的成功。一个关键的归纳偏差是模型可以对各种变换（如平移、旋转和缩放）保持不变性。现有的图神经网络（GNN）方法只能保持置换不变性，不能保证对其他变换的不变性。除了GNNs之外，其他的工作设计了复杂的变换不变层，这在计算上是昂贵的，并且很难扩展。为了解决这个问题，我们重新探讨了为什么现有的神经网络在处理几何数据时不能保持变换不变性。我们的发现表明，变换不变性和保持距离的初始表示足以实现变换不变性，而不需要复杂的神经层设计。基于这些发现，我们提出了变换不变神经网络（TinvNN），这是一种直观而通用的几何数据框架。具体地说，我们通过修改多维尺度实现了变换不变性和保持距离的初始点表示，然后将这些表示输入神经网络。我们证明了TinvNN能够严格保证变换不变性，具有足够的通用性和灵活性，可以与现有的神经网络相结合。在点云分析和组合优化方面的大量实验结果证明了该方法的有效性和普遍适用性。基于实验结果，我们主张TinvNN应被视为进一步研究变换不变几何深度学习的一个新的起点和基本基线。
摘要：Geometric deep learning, i.e., designing neural networks to handle the ubiquitous geometric data such as point clouds and graphs, have achieved great successes in the last decade. One critical inductive bias is that the model can maintain invariance towards various transformations such as translation, rotation, and scaling. The existing graph neural network (GNN) approaches can only maintain permutation-invariance, failing to guarantee invariance with respect to other transformations. Besides GNNs, other works design sophisticated transformation-invariant layers, which are computationally expensive and difficult to be extended. To solve this problem, we revisit why the existing neural networks cannot maintain transformation invariance when handling geometric data. Our findings show that transformation-invariant and distance-preserving initial representations are sufficient to achieve transformation invariance rather than needing sophisticated neural layer designs. Motivated by these findings, we propose Transformation Invariant Neural Networks (TinvNN), a straightforward and general framework for geometric data. Specifically, we realize transformation-invariant and distance-preserving initial point representations by modifying multi-dimensional scaling before feeding the representations into neural networks. We prove that TinvNN can strictly guarantee transformation invariance, being general and flexible enough to be combined with the existing neural networks. Extensive experimental results on point cloud analysis and combinatorial optimization demonstrate the effectiveness and general applicability of our proposed method. Based on the experimental results, we advocate that TinvNN should be considered a new starting point and an essential baseline for further studies of transformation-invariant geometric deep learning.

其他(5篇)

【1】 Towards Disturbance-Free Visual Mobile Manipulation
标题：走向无干扰的视觉移动操作
链接：https://arxiv.org/abs/2112.12612

作者：Tianwei Ni,Kiana Ehsani,Luca Weihs,Jordi Salvador
机构：Universit´e de Montr´eal & Mila, ∗ Work was primarily done during internship at AI, † Equal advising
摘要：嵌入式人工智能在大量机器人任务的仿真中显示了有希望的结果，包括视觉导航和操纵。以前的工作通常追求高成功率和最短路径，而在很大程度上忽略了交互过程中碰撞引起的问题。这种缺乏优先级的情况是可以理解的：在模拟环境中，破坏虚拟对象没有固有的成本。因此，尽管最终取得了成功，但训练有素的代理经常会与对象发生灾难性碰撞。在机器人领域，碰撞的成本非常高，避免碰撞是确保机器人能够安全部署在现实世界中的一个长期而关键的主题。在这项工作中，我们朝着无碰撞/干扰的嵌入式人工智能代理迈出了第一步，实现了视觉移动操作，促进了在真实机器人中的安全部署。我们开发了一种新的干扰避免方法，其核心是干扰预测的辅助任务。当与干扰惩罚相结合时，我们的辅助任务通过将干扰知识提取到agent中，极大地提高了样本效率和最终性能。我们在Manufolhor上的实验表明，与原始基线相比，我们的方法在测试具有新对象的场景时，成功率从61.7%提高到85.6%，无干扰的成功率从29.8%提高到50.2%。广泛的消融研究表明了我们的管道方法的价值。项目现场位于https://sites.google.com/view/disturb-free
摘要：Embodied AI has shown promising results on an abundance of robotic tasks in simulation, including visual navigation and manipulation. The prior work generally pursues high success rates with shortest paths while largely ignoring the problems caused by collision during interaction. This lack of prioritization is understandable: in simulated environments there is no inherent cost to breaking virtual objects. As a result, well-trained agents frequently have catastrophic collision with objects despite final success. In the robotics community, where the cost of collision is large, collision avoidance is a long-standing and crucial topic to ensure that robots can be safely deployed in the real world. In this work, we take the first step towards collision/disturbance-free embodied AI agents for visual mobile manipulation, facilitating safe deployment in real robots. We develop a new disturbance-avoidance methodology at the heart of which is the auxiliary task of disturbance prediction. When combined with a disturbance penalty, our auxiliary task greatly enhances sample efficiency and final performance by knowledge distillation of disturbance into the agent. Our experiments on ManipulaTHOR show that, on testing scenes with novel objects, our method improves the success rate from 61.7% to 85.6% and the success rate without disturbance from 29.8% to 50.2% over the original baseline. Extensive ablation studies show the value of our pipelined approach. Project site is at https://sites.google.com/view/disturb-free

【2】 Attentive Multi-View Deep Subspace Clustering Net
标题：关注的多视点深子空间聚类网络
链接：https://arxiv.org/abs/2112.12506

作者：Run-kun Lu,Jian-wei Liu,Xin Zuo
机构：Department of Automation, College of Information Science and Engineering, China University of Petroleum, Beijing, Mailbox, Changping District, Beijing
摘要：在本文中，我们提出了一种新的注意多视角深子空间网（AMVDSN），它深入挖掘多视角的潜在一致性和视角特定信息，并通过考虑每个视角通过注意机制获得的动态贡献来融合这些信息。不同于大多数多视点子空间学习方法，它们直接在原始数据上重建数据点，或者只考虑在深或浅空间中学习表示时的一致性或互补性；我们提出的方法试图找到一个联合潜在表示，该表示明确地考虑了多个视图之间的一致性和视图特定信息，然后对学习到的联合潜在表示执行子空间聚类。此外，不同的视图对表示学习的贡献不同，因此我们引入注意机制来获取每个视图的动态权重，这在多视图子空间聚类领域比以前的融合方法表现得更好。与传统的子空间聚类方法相比，该算法直观，只需使用随机梯度下降法（SGD）即可轻松优化，因为神经网络框架具有很强的非线性表征能力。在七个真实数据集上的实验结果证明了我们提出的算法对一些最先进的子空间学习方法的有效性。
摘要：In this paper, we propose a novel Attentive Multi-View Deep Subspace Nets (AMVDSN), which deeply explores underlying consistent and view-specific information from multiple views and fuse them by considering each view's dynamic contribution obtained by attention mechanism. Unlike most multi-view subspace learning methods that they directly reconstruct data points on raw data or only consider consistency or complementarity when learning representation in deep or shallow space, our proposed method seeks to find a joint latent representation that explicitly considers both consensus and view-specific information among multiple views, and then performs subspace clustering on learned joint latent representation.Besides, different views contribute differently to representation learning, we therefore introduce attention mechanism to derive dynamic weight for each view, which performs much better than previous fusion methods in the field of multi-view subspace clustering. The proposed algorithm is intuitive and can be easily optimized just by using Stochastic Gradient Descent (SGD) because of the neural network framework, which also provides strong non-linear characterization capability compared with traditional subspace clustering approaches. The experimental results on seven real-world data sets have demonstrated the effectiveness of our proposed algorithm against some state-of-the-art subspace learning approaches.

【3】 DD-NeRF: Double-Diffusion Neural Radiance Field as a Generalizable Implicit Body Representation
标题：DD-NERF：双扩散神经辐射场作为一种可推广的隐式身体表示
链接：https://arxiv.org/abs/2112.12390

作者：Guangming Yao,Hongzhi Wu,Yi Yuan,Kun Zhou
机构： NetEase Fuxi AI Lab, State Key Lab of CAD&CG, Zhejiang University
备注：8 pages, 4 figures
摘要：我们提出了DD-NeRF，一种新的广义隐式场，用于从任意输入视图中表示人体几何和外观。其核心贡献是一种双扩散机制，它利用稀疏卷积神经网络构建两个在不同层次上代表人体的体积：一个粗糙的身体体积利用无约束可变形网格提供大规模几何指导，细节特征体从局部图像特征中学习复杂的几何结构。我们还使用转换器网络来聚合图像特征和视图中的原始像素，以计算最终的高保真辐射场。在各种数据集上的实验表明，该方法在几何重建和新的视图合成质量方面都优于以往的工作。
摘要：We present DD-NeRF, a novel generalizable implicit field for representing human body geometry and appearance from arbitrary input views. The core contribution is a double diffusion mechanism, which leverages the sparse convolutional neural network to build two volumes that represent a human body at different levels: a coarse body volume takes advantage of unclothed deformable mesh to provide the large-scale geometric guidance, and a detail feature volume learns the intricate geometry from local image features. We also employ a transformer network to aggregate image features and raw pixels across views, for computing the final high-fidelity radiance field. Experiments on various datasets show that the proposed approach outperforms previous works in both geometry reconstruction and novel view synthesis quality.

【4】 Predição da Idade Cerebral a partir de Imagens de Ressonância Magnética utilizando Redes Neurais Convolucionais
标题：大脑皮层成像部利用红色神经系统对脑组织进行成像研究(英文：Predição da Idade Clear a Partir de Imagens de Ressonância Magnética Uszando Redes Neurais Convolucionais)
链接：https://arxiv.org/abs/2112.12609

作者：Victor H. R. Oliveira,Augusto Antunes,Alexandre S. Soares,Arthur D. Reys,Robson Z. Júnior,Saulo D. S. Pedro,Danilo Silva
机构：Para a Alzheimer’s Disease Neuroimaging Initiative, e o Australian Imaging Biomarkers and Lifestyle flagship study of ageing, Universidade Federal de Santa Catarina, Florianópolis, SC, Grupo , Belo Horizonte, MG
备注：3 pages, 3 figures, in Portuguese, accepted at XVIII Congresso Brasileiro de Inform\'atica em Sa\'ude (CBIS 2021)
摘要：在这项工作中，研究了从磁共振图像预测大脑年龄的深度学习技术，旨在帮助识别自然老化过程的生物标志物。生物标志物的识别有助于检测早期神经退行性变过程，以及预测与年龄相关或非年龄相关的认知能力下降。在这项工作中实现并比较了两种技术：应用于体积图像的三维卷积神经网络和应用于轴平面切片的二维卷积神经网络，以及随后的单个预测融合。2D模型获得了最佳结果，其平均绝对误差为3.83年Neste trabalho的研究表明，生物材料的预处理过程是一种自然环境识别的辅助方法。生物标志物的识别是对神经退行性变过程进行检测的一种方法，它是一种有效的、潜在的预防疾病认知关系的方法。两种方法都可以实现对小梁结构的比较：我们可以将三维图像与三维图像进行比较，然后再将二维图像与三维图像进行比较，然后再将三维图像与三维图像进行比较。O melhor resultado foi obtido pelo modelo 2D，您可以在3.83 anos的绝对错误中找到答案。
摘要：In this work, deep learning techniques for brain age prediction from magnetic resonance images are investigated, aiming to assist in the identification of biomarkers of the natural aging process. The identification of biomarkers is useful for detecting an early-stage neurodegenerative process, as well as for predicting age-related or non-age-related cognitive decline. Two techniques are implemented and compared in this work: a 3D Convolutional Neural Network applied to the volumetric image and a 2D Convolutional Neural Network applied to slices from the axial plane, with subsequent fusion of individual predictions. The best result was obtained by the 2D model, which achieved a mean absolute error of 3.83 years. -- Neste trabalho s\~ao investigadas t\'ecnicas de aprendizado profundo para a predi\c{c}\~ao da idade cerebral a partir de imagens de resson\^ancia magn\'etica, visando auxiliar na identifica\c{c}\~ao de biomarcadores do processo natural de envelhecimento. A identifica\c{c}\~ao de biomarcadores \'e \'util para a detec\c{c}\~ao de um processo neurodegenerativo em est\'agio inicial, al\'em de possibilitar prever um decl\'inio cognitivo relacionado ou n\~ao \`a idade. Duas t\'ecnicas s\~ao implementadas e comparadas neste trabalho: uma Rede Neural Convolucional 3D aplicada na imagem volum\'etrica e uma Rede Neural Convolucional 2D aplicada a fatias do plano axial, com posterior fus\~ao das predi\c{c}\~oes individuais. O melhor resultado foi obtido pelo modelo 2D, que alcan\c{c}ou um erro m\'edio absoluto de 3.83 anos.

【5】 On the relationship between calibrated predictors and unbiased volume estimation
标题：关于校准预测因子与无偏估积量的关系
链接：https://arxiv.org/abs/2112.12560

作者：Teodora Popordanoska,Jeroen Bertels,Dirk Vandermeulen,Frederik Maes,Matthew B. Blaschko
机构：Center for Processing Speech and Images, Dept. ESAT, KU Leuven, Belgium
备注：Published at MICCAI 2021
摘要：机器学习驱动的医学图像分割已成为医学图像分析的标准。然而，深度学习模型倾向于过度自信的预测。这使得医学成像和更广泛的机器学习社区重新关注校准预测。校准预测是对标签概率的估计，其对应于以置信度为条件的标签的真实预期值。这种校准预测在一系列医学成像应用中具有实用价值，包括不确定性下的手术规划和主动学习系统。同时，它通常是一个精确的体积测量，对于许多医疗应用来说非常重要。本文研究了模型校准和体积估计之间的关系。我们从数学和经验两方面证明，如果对每幅图像校准预测器，我们可以通过对图像每像素/体素的概率分数的期望来获得正确的体积。此外，我们还证明了校正分类器的凸组合保持了体积估计，但不保持校正。因此，我们得出结论，有一个校准的预测是获得无偏估计量的充分条件，但不是必要条件。我们通过收集18种不同（校准）的训练策略，对BraTS 2018的胶质瘤体积估计任务和ISLES 2018数据集的缺血性卒中病变体积估计任务，从经验上验证了我们的理论发现。
摘要：Machine learning driven medical image segmentation has become standard in medical image analysis. However, deep learning models are prone to overconfident predictions. This has led to a renewed focus on calibrated predictions in the medical imaging and broader machine learning communities. Calibrated predictions are estimates of the probability of a label that correspond to the true expected value of the label conditioned on the confidence. Such calibrated predictions have utility in a range of medical imaging applications, including surgical planning under uncertainty and active learning systems. At the same time it is often an accurate volume measurement that is of real importance for many medical applications. This work investigates the relationship between model calibration and volume estimation. We demonstrate both mathematically and empirically that if the predictor is calibrated per image, we can obtain the correct volume by taking an expectation of the probability scores per pixel/voxel of the image. Furthermore, we show that convex combinations of calibrated classifiers preserve volume estimation, but do not preserve calibration. Therefore, we conclude that having a calibrated predictor is a sufficient, but not necessary condition for obtaining an unbiased estimate of the volume. We validate our theoretical findings empirically on a collection of 18 different (calibrated) training strategies on the tasks of glioma volume estimation on BraTS 2018, and ischemic stroke lesion volume estimation on ISLES 2018 datasets.

机器翻译，仅供参考

点击“阅读原文”获取带摘要的学术速递

反向激励，在加速这个社会的黑化

“死人房地产”？湖北随州强制购买公墓收费引发热议

俄乌打仗，中国损失惨重，高达数千亿美元，未来损失不可估量

重磅突发！中国银行，中国工商银行、建设银行、中信银行、兴业银行，停止接受来自俄罗斯的人民币付款！

强制购买公墓，湖北随州太随意

计算机视觉与模式识别学术速递[12.24]

您可能也对以下帖子感兴趣

反向激励，在加速这个社会的黑化

“死人房地产”？湖北随州强制购买公墓收费引发热议

俄乌打仗，中国损失惨重，高达数千亿美元，未来损失不可估量

重磅突发！中国银行，中国工商银行、建设银行、中信银行、兴业银行，停止接受来自俄罗斯的人民币付款！

强制购买公墓，湖北随州太随意

生成图片，分享到微信朋友圈

计算机视觉与模式识别学术速递[12.24]

您可能也对以下帖子感兴趣